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Foreword 


It has been a privilege of my career to be involved in many information retrieval 
evaluation campaigns. For a range of reasons, my involvement with NTCIR has 
been the longest and the most enjoyable. 

Principal among the reasons is Noriko Kando. Evaluation campaigns are sus- 
tained by the unyielding enthusiasm of a core organizer. Kando-san has devotedly 
innovated this campaign from its very start. Thanks to her, what you will consis- 
tently see in the chapters of this book is a sequence of tasks or tracks that were 
ahead of their time. NTCIR was the first to explore patent search, first to incorporate 
life logging data, and first to examine retrieval of mathematical formulas. 

NTCIR has innovated in the methodologies used to measure a shared task: the 
visualization and summarization tasks, for example, require a quite different 
approach to evaluation than you would see in many other campaigns. Thanks to 
Sakai-san’s diligent creation, many of the chapters will describe assessment with 
novel measures. NTCIR was the first to use graded evaluation. 

Diversity of excellent evaluation research is what I know I will see when I attend 
NTCIR at the NII building in Chiyoda-ku. Such a range of innovations can only come 
from a team of outstanding collaborators: you can see from the diversity of chapter 
authors just how many have contributed their ideas and hard work to NTCIR. 

Dedication to quality is another reason for my regular visits to Tokyo. Such is the 
commitment of Kando-san and her team that on the evening marking the end of each 
NTCIR conference (a time when normal organizers just want to sleep) the team meet 
up in the NII Tower to discuss what worked well, what didn’t, and how to improve. 
Oard-san’s thoughtful advice is often to be found there. At that meeting less than a 
few hours after NTCIR has completed, the next campaign is being planned. 


vi Foreword 


The work described in this book charts the progression of the academic field of 
information retrieval research from a rather limited library focused research topic to 
a rich multi-faceted study of information access of all forms of content. It has been 
my honor to be a part of this campaign and I look forward to what rich new topics it 
will tackle in the future, at its sesquiennial pace. 


Melbourne, Australia Mark Sanderson 


Preface 


The NTCIR-1 Conference took place in 1999. Back then, NTCIR stood for NACSIS 
Test Collection for Information Retrieval systems. Ever since, NTCIR has grown in 
size, broadened its scope, and evolved; now we know it as NII Testbeds and 
Community for Information access Research. We editors of this book would like to 
thank everyone who has been involved in NTCIR in the past two decades or so, and 
in particular the following people, for making this book happen. 


The chapter authors: Akiko Aizawa, Rami Albatal, Kuang-Hua Chen, Duc-Tien 
Dang-Nguyen, Zhicheng Dou, Atsushi Fujii, Takahiro Fukushima, Isao Goto, 
Cathal Gurrin, Graham Healy, Tsutomu Hirao, Frank Hopfgartner, Makoto 
Iwayama, Hideo Joho, Noriko Kando, Makoto P. Kato, Tsuneaki Kato, Kazuaki 
Kishida, Michael Kohlhase, Yiqun Liu, Cheng Luo, Teruko Mitamura, 
Hidetsugu Nanba, Eric Nyberg, Douglas W. Oard, Manabu Okumura, Tetsuya 
Sakai, Mark Sanderson, Yohei Seki, Ruihua Song, Masaharu Yoshioka, Min 
Zhang, and Liting Zhou; 

The chapter reviewers: Martin Braschler, Wolfgang Hurst, Nattiya Kanhabua, 
Stefano Mizzaro, Tatsunori Mori, Ian Soboroff, Damiano Spina, Takehiro 
Yamamoto, and Richard Zanibbi; 

Those who offered constructive comments on the early drafts of the chapters that 
were publicly available online; 

Past and present NTCIR general chairs, PC chairs, EVIA chairs, organizing 
committee members, and staff; 

Past and present NTCIR task organizers and participants, and last but not least; 
Springer’s Mio Sugino for her support and perseverance. 


viii Preface 


This is the first book on NTCIR. A copy of it will be given to all NTCIR-15 
participants in December 2020. It has been a long journey, but the journey con- 
tinues. Stay safe and healthy. 


Tokyo, Japan Tetsuya Sakai 
College Park, MD, USA Douglas W. Oard 
Tokyo, Japan Noriko Kando 


April 2020 
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Chapter 1 A) 
Graded Relevance E 


Tetsuya Sakai 


Abstract NTCIR was the first large-scale IR evaluation conference series to con- 
struct test collections with graded relevance assessments: the NTCIR-1 test collec- 
tions from 1998 already featured relevant and partially relevant documents. In this 
chapter, I provide a survey on the use of graded relevance assessments and of graded 
relevance measures in the past NTCIR tasks which primarily tackled ranked retrieval. 
My survey shows that the majority of the past tasks fully utilised graded relevance 
by means of graded evaluation measures, but not all of them; interestingly, even a 
few relatively recent tasks chose to adhere to binary relevance measures. I conclude 
the chapter by a summary of my survey in table form. 


1.1 Introduction 


The evolution of NTCIR is quite different from that of TREC when it comes to 
how relevance assessments have been collected and utilised. In 1992, TREC started 
off with a high-recall task (i.e., the adhoc track), with binary relevance assess- 
ments (Harman 2005). Moreover, early TREC tracks heavily relied on evaluation 
measures based on binary relevance such as 7 1-point Average Precision, R-precision, 
and (noninterpolated) Average Precision. It was in the TREC 2000 (a.k.a. TREC-9) 
Main Web task that 3-point graded relevance assessments were introduced, based 
on feedback from web search engine companies at that time Hawking and Craswell 
(2005, p. 204). Accordingly, this task also Jarvelin and Kekäläinen (2000) adopted 
Discounted Cumulative Gain (DCG), to utilise the graded relevance assessments. 
NTCIR has collected graded relevance assessments from the very beginning: the 
NTCIR-1 test collections from 1998 already featured relevant and partially rele- 
vant documents (Kando et al. 1999). Thus, while NTCIR borrowed many ideas from 
TREC when it was launched in the late 1990s, its policy regarding relevance assess- 
ments seems to have followed the paths of Cranfield IT (which had 5-point relevance 


T. Sakai (EX) 
Waseda University, Shinjuku-ku Okubo 3-4-1 63-05-04, Tokyo 169-8555, Japan 
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levels) Cleverdon et al. (1966, p. 21), Oregon Health Sciences University’s MED- 
LINE Data Collection (QHSUMED) (which had 3-point relevance levels) (Hersh 
et al. 1994), as well as the first Japanese IR test collections BMIR-JI and BMIR-J2 
(which also had 3-point relevance levels) (Sakai et al. 1999). 

Interestingly, with perhaps a notable exception of the aforementioned TREC 2000 
Main Web Task, it is true for both TREC and NTCIR that the introduction of graded 
relevance assessments did not necessarily mean immediate adoption of evaluation 
measures that can utilise graded relevance. For example, while the TREC 2003-2005 
robust tracks constructed adhoc IR test collections with 3-point graded relevance 
assessments, they adhered to binary relevance measures such as Average Precision 
(AP). Similarly, as I shall discuss in this chapter,! while almost all of the past IR 
tasks of NTCIR had graded relevance assessments, not all of them fully utilised 
them by means of graded relevance measures. This is the case despite the fact that a 
graded relevance measure called the normalised sliding ratio (NSR)* was proposed 
in 1968 (Pollock 1968), and was discussed in an 1997 book by Korfhage along with 
another graded relevance measure (Korfhage 1997, p.209). 


1.2 Graded Relevance Assessments, Binary Relevance 
Measures 


This section provides an overview of NTCIR ranked retrieval tasks that did not 
use graded relevance evaluation measures even though they had graded relevance 
assessments. 


1.2.1 Early IR and CLIR Tasks (NTCIR-1 Through -5) 


The Japanese IR and (Japanese-English) crosslingual tasks of NTCIR-1 (Kando et al. 
1999) constructed test collections with 3-point relevance levels, but used binary 
relevance measures such as AP and R-precision by either treating the relevant and 
partially relevant documents as “relevant” or treating only the relevant documents as 
“relevant.” However, it should be stressed at this point that using binary relevance 
measures with different relevance thresholds cannot serve as substitutes for a graded 
relevance measure that enables optimisation towards an ideal ranked list (i.e., a list 
of documents sorted in decreasing order of relevance levels). If partially relevant 


LA 31-page, March 2019 version of this chapter is available on arxiv.org Sakai (2019). The arxiv 
version contains the definitions of the main graded relevance measures used at NTCIR, as well as 
details on how graded relevance levels were constructed from individual assessors’ judgements for 
some of the tasks. 


2NSR is actually what is now known as normalised (nondiscounted) cumulative gain (nCG): See 
Sakai (2019). 
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documents are ignored, a Search Engine Result Page (SERP) whose top / documents 
are all partially relevant and one whose top / documents are all nonrelevant can 
never be distinguished from each other; if relevant documents and partially relevant 
documents are all treated as relevant, a SERP whose top / documents are all relevant 
and one whose top / documents are all partially relevant can never be distinguished 
from each other. 

The Japanese and English (monolingual and crosslingual) IR tasks of NTCIR- 
2 (Kando et al. 2001) constructed test collections with 4-point relevance levels. 
However, the organisers used binary relevance measures such as AP and R-precision 
with two different relevance thresholds. As for the Chinese monolingual and Chinese- 
English IR tasks of NTCIR-2 (Chen and Chen 2001), three judges independently 
judged each pooled document using 4-point relevance levels, and then a score was 
assigned to each relevance level. Finally, the scores were averaged across the three 
assessors. The organisers then applied two different thresholds to map the scores to 
binary rigid relevance and relaxed relevance data. For evaluating the runs, rigid and 
relaxed versions of recall-precision curves (RP curves) were used. 

The NTCIR-3 CLIR (Cross-Language IR) task (Chen et al. 2002) was similar 
to the previous IR tasks: 4-point relevance levels were used, and two relevance 
thresholds were used. Finally, rigid and relaxed versions of AP were computed for 
eachrun. The NTCIR-4 and NTCIR-5 CLIR tasks (Kishida et al. 2004, 2005) adhered 
to the above practice. 

All of the above tasks used the trec_eval program from TREC to compute 
binary relevance measures such as AP. 


1.2.2 Patent (NTCIR-3 Through-6) 


The NTCIR-3 Patent Retrieval task (Iwayama et al. 2003) was a news-to-patent 
technical survey search task, with 4-point relevance levels. RP curves were drawn 
based on strict relevance and relaxed relevance. 

The main task of the NTCIR-4 Patent Retrieval task (Fujii et al. 2004) was a patent- 
to-patent invalidity search task. There were two types of relevant documents: A (a 
patent that can invalidate a given claim on its own) and B (a patent that can invalidate 
a given claim only when used with one or more other patents). For example, patents 
Bı and B2 may each be nonrelevant (as they cannot invalidate a claim individually), 
but if they are both retrieved, the pair should serve as one relevant document. At 
the evaluation step, rigid and relaxed APs were computed. Note that the above- 
relaxed evaluation has a limitation: recall the aforementioned example with Bı and 
B2, and consider a SERP that managed to return only one of them (say B,). Relaxed 
evaluation rewards the system for returning B1, even though this document alone 
does not invalidate the claim. 

The Document Retrieval subtask of the NTCIR-5 Patent Retrieval task (Fujii et al. 
2005) was similar to its predecessor, but the relevant documents were determined 
purely based on whether and how they were actually used by a patent examiner to 
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reject a patent application; no manual relevance assessments were conducted for this 
subtask. The graded relevance levels were defined as follows: A (a citation that was 
actually used on its own to reject a given patent application) and B (a citation that 
was actually used along with another one to reject a given patent application). As 
for the evaluation measure for Document Ranking, the organisers adhered to rigid 
and relaxed APs. In addition, the task organisers introduced a Passage Retrieval 
subtask by leveraging passage-level binary relevance assessments collected as in the 
NTCIR-4 Patent task: given a patent, systems were required to rank the passages from 
that same patent. As both single passages and groups of passages can potentially be 
relevant to the source patent (i.e., the passage(s) can serve as evidence to determine 
that the entire patent is relevant to a given claim), this poses a problem similar 
to the one discussed above with patents Bı and B2: for example, if two passages 
pı, p2 are relevant as a group but not individually, and if pı is ranked at i and pp is 
ranked at i’(> i), how should the SERP of passage be evaluated? To address this, 
the task organisers introduced a binary relevance measure called the Combinational 
Relevance Score (CRS), which assumes that the user who scans the SERP must reach 
as far as i’ to view both pı and p2.° 

The Japanese Document Retrieval subtask of the NTCIR-6 Patent Retrieval 
task (Fujii et al. 2007) had two different sets of graded relevance assessments; the first 
set (“Def0” with A and B documents) was defined in the same way as in NTCIR-S; the 
second set (“Def1”) was automatically derived from Def0 based on the International 
Patent Classification (IPC) codes as follows: H (the set of IPC subclasses for this 
cited patent has no overlap with that of the input patent), A (the set of IPC subclasses 
for this cited patent has some overlap with that of the input patent), and B (the set of 
IPC subclasses for this cited patent is identical to that of the input patent. As for the 
English Document Retrieval subtask, the relevance levels were also automatically 
determined based on IPC codes, but only two types of relevant documents (A and B) 
were identified, as each USPTO patent is given only one IPC code. In both subtasks, 
AP was computed by considering different combinations of the above relevance 
levels. 


1.2.3 SpokenDoc/SpokenQuery& Doc 
(NTCIR-9 Through -12) 


The Spoken Document Retrieval (SDR) subtask of the NTCIR-9 SpokenDoc task 
(Akiba et al. 2011) had two “subsubtasks”: Lecture Retrieval and Passage Retrieval, 
where a passage is any sequence of consecutive inter-pausal units. Passage-level 
relevance assessments were obtained on a 3-point scale, and it appears that the lecture- 


3In fact, AP, Q or any measure from the NCU family (Sakai and Robertson 2008) can easily be 
extended to handle combinational relevance for Document Retrieval (See the above example with 
(B1, B2)) and for Passage Retrieval (See the above example with (p1, p2)): See Sakai (2019). 
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level (binary) relevance was deduced from them.’ AP was used for evaluating Lecture 
Retrieval, whereas variants of AP, called utterance-based (M)AP, pointwise (M)AP, 
and fractional (M)AP were used for evaluating Passage Retrieval. These are all binary 
relevance measures. The NTCIR-10 SpokenDoc-2 Spoken Content Retrieval (SCR) 
subtask (Akiba et al. 2013) was similar to the SDR subtask at NTCIR-9, with Lecture 
Retrieval and Passage Retrieval subsubtasks. Lecture Retrieval used a revised version 
of the NTCIR-9 SpokenDoc topic set, and its gold data does not contain graded 
relevance assessments?; binary relevance AP was used for the evaluation. As for 
Passage Retrieval, a new topic set was devised, again with 3-point relevance levels. 
The AP variants from the NTCIR-9 SDR task were used for the evaluation again. 
The Slide Group Segment (SGS) Retrieval subsubtask of the NTCIR-11 Spoken- 
Query& Doc SCR subtask involved the ranking of predefined retrieval units (i.e., 
SGSs), unlike the Passage Retrieval subsubtask which allows any sequence of con- 
secutive inter-pausal units as a retrieval unit. Three-point relevance levels were used 
to judge the SGSs: R (relevant), P (partially relevant), and I (nonrelevant). However, 
binary AP was used for the evaluation after collapsing the grades to binary. As for 
the passage-level relevance assessments, they were derived from the SGSs labelled 
R or P, and were considered binary; the three AP variants were used for this subsub- 
task again. Segment Retrieval was continued at the NTCIR-12 SpokenQuery&Doc-2 
task, again with the same 3-point relevance levels and AP as the evaluation measure. 


1.2.44 Math/MathIR (NTCIR-10 Through -12) 


In the Math Retrieval subtask of the NTCIR-10 Math Task, retrieved mathematical 
formulae were judged on a 3-point scale. Up to two assessors judged each formula, 
and initially 5-point relevance scores were devised based on the results. For example, 
for formulae judged by one assessor, they were given 4 points if the judged label 
was relevant; for those judged by two assessors, they were given 4 points if both 
of them gave them the relevant label. Finally, the scores were mapped to a 3-point 
scale: Documents with scores 4 or 3 were treated as relevant; those with 2 or 1 
were treated as partially relevant; those with 0 were treated as ronrelevant. However, 
at the evaluation step, only binary relevance measures such as AP and Precision 
were computed using trec_eval, after collapsing the grades to binary. Similarly, 
in the Math Retrieval subtask of the NTCIR-11 Math Task (Aizawa et al. 2014), 
two assessors independently judged each retrieved unit on a 3-point scale, and the 


‘The official test collection data of the NTCIR-9 SDR task (evalsdr) contains only passage-level 
gold data. 

>This was verified by examining SpokenDoc2-formalrun-SCR-LECTURE-golden-20130129.xml 
in the SpokenDoc-2 test collection http://research.nii.ac.jp/ntcir/permission/ntcir- 10/perm-en- 
SPOKENDOC.html. 

This was verified by examining _http://SpokenDoc2-formalrun-SCR-PASSAGE- golden- 
20130215.xml in the SpokenDoc-2 test collection http://research.nii.ac.jp/ntcir/permission/ntcir- 
10/perm-en-SPOKENDOC.html. 
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final relevance levels were also on a 3-point scale. If the two assessor labels were 
relevant/relevant or relevant/partially relevant, the final grade was relevant; if the two 
labels were both nonrelevant, the final grade was nonrelevant; the other combinations 
were considered partially relevant. As for the evaluation measures, bpref (Buckley 
and Voorhees 2004; Sakai 2007; Sakai and Kando 2008) was computed along with 
AP and Precision using trec_eval. 

The NTCIR-12 MathIR task was similar to the Math Retrieval subtask of the 
aforementioned Math tasks. Up to four assessors judged each retrieved unit using a 
3-point scale, and the individual labels were consolidated to form the final 3-point 
scale assessments. As for the evaluation, only Precision was computed at several 
cutoffs using trec_eval. 

The NTCIR-11 Math (Aizawa et al. 2014) and NTCIR-12 MathIR (Zanibbi et al. 
2016) overview papers suggest that one reason for adhering to binary relevance 
measures is that trec_eval could not handle graded relevance. On the other hand, 
this may not be the only reason: in the MathIR overview paper, it is reported that the 
organisers chose Precision because it is “simple to understand” (Zanibbi et al. 2016). 
Thus, some researchers indeed choose to focus on evaluation with binary relevance 
measures, even in the NTCIR community where we have graded relevance data by 
default and a tool for computing graded relevance measures is known.’ 


1.3 Graded Relevance Assessments, Graded Relevance 
Measures 


This section provides an overview of NTCIR ranked retrieval tasks that employed 
graded relevance evaluation measures to fully enjoy the benefit of having graded 
relevance assessments. 


1.3.1 Web (NTCIR-3 Through-5) 


The NTCIR-3 Web Retrieval task (Eguchi et al. 2003) was the first NTCIR task to use 
a graded relevance evaluation measure, namely, DCG.* Four-point relevance levels 
were used. In addition, assessors chose a very small number of “best” documents 
from the pools. To compute DCG, two different gain value settings were used: Rigid 
(3 for highly relevant, 2 for fairly relevant, 0 otherwise) and Relaxed (3 for highly 


7 NTCIREVAL has been available on the NTCIR website since 2010; its predecessor ir4ga_eval 
was released in 2008 (Sakai et al. 2008). Note also that TREC 2010 released https://trec.nist. 
gov/data/web/10/gdeval.pl for computing Normalised Discounted Cumulative Gain (nDCG) and 
Expected Reciprocal Rank (ERR). 

8This was the DCG as originally defined by Jarvelin and Kekäläinen (2000) with the logarithm base 
b = 2, which means that gain discounting is not applied to documents at ranks 1 and 2. See also 
Sect. 1.3.3. 
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relevant, 2 for fairly relevant, 1 for partially relevant, 0 otherwise). The organisers 
of the Web Retrieval task also defined a graded relevance evaluation measure called 
Weighted Reciprocal Rank (WRR), designed for navigational searches. However, 
what was actually used in the task was the binary relevance Reciprocal Rank (RR), 
with two different relevance thresholds. Therefore, this measure will be denoted 
“(W)RR” hereafter whenever graded relevance is not utilised. Other binary relevance 
measures including AP and R-precision were also used in this task. For a comparison 
of evaluation measures designed for navigational intents including RR, WRR, and 
P+, see Sakai (2007). 

The NTCIR-4 WEB Informational Retrieval Task (Eguchi et al. 2004) was similar 
to its predecessor, with 4-point relevance levels; the evaluation measures were DCG, 
(W)RR, Precision, etc. On the other hand, the NTCIR-4 WEB Navigational Retrieval 
Task (Oyama et al. 2004), used 3-point relevance levels: A (relevant), B (partially 
relevant), and D (nonrelevant); the evaluation measures were DCG and (W)RR, and 
two gain values settings for DCG were used: (A, B, D) = (3, 0,0) and (A, B, D) = 
(3, 2, 0). 

The NTCIR-5 WEB task ran the Navigational Retrieval subtask, which is basically 
the same as its predecessor, with 3-point relevance levels and DCG and (W)RR as 
the evaluation measures. For computing DCG, three gain value settings were used: 
(A, B, D) = (3,0,0), (A, B, D) = (3, 2, 0), and (A, B, D) = (3, 3, 0). Note that 
the first and the third settings reduce DCG to binary relevance measures. 


1.3.2 CLIR (NTCIR-6) 


At the NTCIR-6 CLIR task, 4-point relevance levels (S,A,B,C) were used and rigid 
and relaxed AP scores were computed using trec_eval as before. In addition, 
the organisers computed “as a trial” (Kishida et al. 2007) the following graded rel- 
evance measures using their own script: nDCG (as defined originally by Jarvelin 
and Kekäläinen 2002), Q-measure (Sakai 2014; Sakai and Zeng 2019) (or “Q”), 
and Kishida’s generalised AP (Kishida 2005). See Sakai (2007) for a compari- 
son of these three graded relevance measures. The CLIR organisers developed a 
program to compute these graded relevance measures, with the gain value setting: 
(S, A, B, C) = (3,2, 1,0). 


1.3.3 ACLIA IR4QA (NTCIR-7 and -8) 


At the NTCIR-7 Information Retrieval for Question Answering (IR4QA) task (Sakai 
et al. 2008), a predecessor of NITCIREVAL called ir4qa_eval was released (See 
Sect. 1.2.4). This tool was used to compute the Q-measure, the “Microsoft ver- 
sion” of nDCG (Sakai 2014), as well as the binary relevance AP. Microsoft nDCG 
(called MSnDCG in NTCIREVAL) fixes a problem with the original nDCG (See also 
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Sect. 1.3.1): in the original nDCG, if the logarithm base is set to (say) b = 10, then 
discounting is not applied from ranks 1 to 10. Hence, the ranks of the relevant doc- 
uments within top 10 do not matter. Microsoft nDCG avoids this problem by using 
1/log(1 + r) as the discount factor for every rank r, but thereby loses the patience 
parameter b (Sakai 2014).° The relevance levels used were L2, L1, and LO. A lin- 
ear gain value setting was used: (L2, L1, LO) = (2, 1,0). The NTCIR-8 IR4QA 
task (Sakai et al. 2010) used the same evaluation methodology as above. 


1.3.4 GeoTime (NTCIR-8 and -9) 


The NTCIR-8 GeoTime task (Gey et al. 2010), which dealt with adhoc IR given 
“when and where’’-type topics, constructed test collections with the following graded 
relevance levels: Fully relevant (the document answers both the “when” and “where” 
aspects of the topic), Partially relevant—where (the document only answers the 
“where” aspect of the topic), and Partially relevant—when (the document only 
answers the “when” aspect of the topic). The evaluation tools from the IR4QA task 
were used to compute (Microsoft) nDCG, Q, and AP, with a gain value of 2 for each 
fully relevant document and a gain value of 1 for each partially relevant one (regard- 
less of “when” or “where”) for the two graded relevance measures.!? The NTCIR-9 
GeoTime task (Gey et al. 2011) used the same evaluation methodology as above. 


1.3.5 CQA (NTCIR-8) 


The NTCIR-8 Community Question Answering (CQA) task (Sakai et al. 2010) was 
an answer ranking task: given a question from Yahoo! Chiebukuro (Japanese Yahoo! 
Answers) and the answers posted in response to that question, rank the answers by 
answer quality. While the Best Answers (BAs) selected by the actual questioners 
were already available in the Chiebukuro data, additional graded relevance assess- 
ments were obtained offline to find Good Answers (GAs), by letting four assessors 
independently judge each posted answer. Each assessor labelled an answer as either 
A (high-quality), B (medium-quality), or C (low-quality), and hence 15 different 
label patterns were obtained: AAAA, AAAB,..., BCCC, CCCC. In the official 
evaluation at NTCIR-8, these patterns were mapped to 4-point relevance levels: for 
example, AAAA and AAAB were mapped to L3-relevant, and ACCC, BCCC and 
CCCC were mapped to LO. In a separate study, the same data were mapped to 9- 
point relevance levels, by giving 2 points to an A and 1 point to a B and summing 


°Dt-nDCG implemented in NTCIREVAL also builds on the Microsoft version of nDCG, not the 
original nDCG. 


10While the GeoTime overview paper suggests that the above relevance levels were mapped to 
binary relevance, this was in fact not the case: see Sakai (2019). 
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up the scores for each pattern. Using the graded Good Answers data, three graded 
relevance measures were computed: normalised gain at? = 1 (nG@1),!' nDCG, and 
Q. In addition, Hit at? = 1 was computed for both Best Answers and Good Answers 
data: this is a binary relevance measure which only cares whether the top-ranked 
item is relevant or not. 


1.3.6 INTENT/IMine (NTCIR-9 Through 12) 


The NTCIR-9 INTENT task overview paper (Song et al. 2011) was the first NTCIR 
overview to mention the use of the NTCIREVAL tool, which can compute various 
graded relevance measures for adhoc and diversified IR including Q, nDCG, and D¢- 
measures (Sakai and Zeng 2019). Dł-nDCG and its components I-rec and D-nDCG 
were used as the official evaluation measures. The Document Retrieval (DR) subtask 
of the INTENT task had intentwise graded relevance assessments on a 5-point scale. 
While the Subtopic Mining (SM) subtask of the INTENT task also used D#-nDCG 
to evaluate ranked lists of subtopic strings, no graded relevance assessments were 
involved in the SM subtask since each subtopic string either belongs to an intent (i.e., 
a cluster of subtopic strings) or not. Hence, the SM subtask may be considered to be 
outside the scope of the present survey; but see Sakai (2019) for a discussion. 

The NTCIR-10 INTENT task was basically the same as its predecessor, with 5- 
point intentwise relevance levels for the DR subtask and Df-nDCG as the primary 
evaluation measure. However, as the intents came with informational/navigational 
tags, new measures called DIN-nDCG and P+Q (Sakai 2014) were used in addition 
to leverage this information. 

The NTCIR-11 [Mine task (Liu et al. 2014) was similar to the INTENT tasks, 
except that its SM subtask required participating systems to return a two-level hierar- 
chy of subtopic strings. The SM subtask was evaluated using the H-measure, which 
combines (a) the accuracy of the hierarchy, (b) the Df-nDCG score based on the 
ranking of the first-level subtopics, and (c) the Dfi-nDCG score based on the ranking 
of the second-level subtopics. However, recall the above remark on the INTENT SM 
subtask: intentwise graded relevance does not come into play in this subtask. On the 
other hand, the [Mine DR subtask was evaluated in a way similar to the INTENT DR 
tasks, with Dł-nDCG computed based on 4-point relevance levels: highly relevant, 
relevant, nonrelevant, and spam. The gain value setting used was: (2, 1, 0, 0).!? The 
IMine task also introduced the TaskMine subtask, which requires systems to rank 
strings that represent subtasks of a given task (e.g., “take diet pills” in response to 
“lose weight.”). This subtask involved graded relevance assessments. Each subtask 
string was judged independently by two assessors from the viewpoint of whether 


'lnG@1 is often referred to as nDCG@ 1; however, note that neither discounting nor cumulation is 
applied at rank 1. 

!2Kindly confirmed by task organisers Yiqun Liu and Cheng Luo in a private email communication 
(March 2019). 
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the subtask is effective for achieving the input task. A 4-point per-assessor relevance 
scale was used,!? with weights (3, 2, 1, 0), and final relevance levels were given as 
the average of the two scores, which means that a 6-point relevance scheme was 
adopted. The averages were used verbatim as gain values: (3.0, 2.5, 2.0, 1.5, 1.0, 0). 
The evaluation measure used was nDCG, but duplicates (i.e., multiple strings repre- 
senting the same subtask) were not rewarded. 

The Query Understanding (QU) subtask of the NTCIR-12 IMine-2 Task 
(Yamamoto et al. 2016), a successor of the previous SM subtasks of INTENT/IMine, 
required systems to return a ranked list of (subtopic, vertical) pairs (e.g., (“iPhone 6 
photo”, Image), (“iPhone 6 review”, Web)) for a given query. The official evaluation 
measure, called the QU-score, is a linear combination of Dfi-nDCG (computed as 
in the INTENT SM subtasks) and the V-score which measures the appropriateness 
of the named vertical for each subtopic string. Despite the binary relevance nature 
of the subtopic mining aspect of the QU subtask, it deserves to be discussed in 
the present survey because the V-score part relies on graded relevance assessments. 
To be more specific, the V-score relies on the probabilities {Pr(v|i)}, for intents 
{i} and verticals {v}, which are derived from 3-point scale relevance assessments: 
2 (highly relevant), 1 (relevant), and 0 (nonrelevant). Hence the QU-score may be 
regarded as a graded relevance measure. The Vertical Incorporating (VI) subtask of 
the NTCIR-12 [Mine-2 Task (Yamamoto et al. 2016) also used a version of Dt-nDCG 
to allow systems to embed verticals (e.g., Vertical-News, Vertical-Image) within a 
ranked list of document IDs for diversified search. More specifically, the organisers 
replaced the intentwise gain value g;(r) at rank r in the global gain formula (Sakai 
2014) with Pr(v(r)|i)g;(7), where v(r) is the vertical type (“Web,” Vertical-News, 
Vertical-Image, etc.) of the document at rank r, and the vertical probability given 
an intent is obtained from 3-point scale relevance assessments as described above. 
As for the intentwise gain value g;(r), it was also on a 3-point scale for the Web 
documents: 2 for highly relevant, 1 for relevant, and 0 for nonrelevant documents. 
Moreover, if the document at r was a vertical, the gain value was set to 2. In addition, 
the VI subtask collected topicwise relevance assessments on a 4-point scale: highly 
relevant, relevant, nonrelevant, and spam. The gain values used were: (2, 1, 0, 0).!4 
As the subtask had a set of very clear, single-intent topics among their full topic set, 
Microsoft nDCG (rather than D#-nDCG) was used for these particular topics. 


1.3.7 RecipeSearch (NTCIR-11) 


While the official evaluation results of Adhoc Recipe Search subtask of the NTCIR- 
11 RecipeSearch Task (Yasukawa et al. 2014) were based on binary relevance, the 


13While the overview (Sect. 4.3) says that a 3-point scale was used, this was in fact not the case: 
kindly confirmed by task organiser Takehiro Yamamoto in a private email communication (March 
2019). 

'4Kindly confirmed by task organisers Yiqun Liu and Cheng Luo in a private email communication 
(March 2019). 
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organisers also explored evaluation based on graded relevance: they obtained graded 
relevance assessments on a 3-point scale for a subset (111 topics) of the full test topic 
set (500 topics).!° Microsoft nDCG was used to leverage the above data with a linear 
gain value setting, along with the binary AP and RR. 


1.3.8  Temporalia (NTCIR-11 and -12) 


The Temporal Information Retrieval (TIR) subtask of the NTCIR-11 Temporalia Task 
collected relevance assessments on a 3-point scale. Each TIR topic contained a past 
question, recency question, future question, and an atemporal question; participating 
systems were required to produce a Search Engine Result Page (SERP) for each of 
the above four questions. This adhoc IR task used Precision and Microsoft nDCG as 
the official measures, and Q for reference. 

While the Temporally Diversified Retrieval (TDR) subtask of the NTCIR-12 
Temporalia-2 Task was similar to the above TIR subtask, it required systems to 
return a fifth SERP, which covers all of the above four temporal classes. That is, 
this fifth SERP is a diversified SERP, where the temporal classes can be regarded 
as different search intents for the same topic. The relevance assessment process fol- 
lowed the practice of the NTCIR-11 TIR task, and the SERPs for the four questions 
were evaluated using nDCG. As for the diversified SERPs, they were evaluated using 
a-nDCG (Clarke et al. 2008) and D#-nDCG. 

A linear gain value setting was used in both of the above subtasks.'° 


1.3.9 STC (NTCIR-12 Through -14) 


The NTCIR-12 Short Text Conversation (STC) task (Shang et al. 2016) was a 
response retrieval task given a tweet (or a Chinese Weibo post). For both Chinese 
and Japanese subtasks, the response tweets were first labelled on a binary scale, for 
each of the following criteria: Coherence, Topical Relevance, Context Independence, 
and Non-repetitiveness. The final graded relevance levels were determined using the 
following mapping scheme: 


if Coherent AND Topically Relevant 
if Context-independent AND Non-repetitive 
RelevanceLevel = L2 
else 
RelevanceLevel = L1 
else 


'SWhile the overview paper says that a 4-point scale was used, this was in fact not the case: kindly 
confirmed by task organiser Michiko Yasukawa (March 2019) in a private email communication. 


'6Kindly confirmed by task organiser Hideo Joho in a private email communication (March 2019). 
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RelevanceLevel = LO. 

Following the quadratic gain value setting often used for web search evalua- 
tion (Burges et al. 2005) and for computing ERR (Chapelle et al. 2009), the Chinese 
subtask organisers mapped the L2, L1, and LO relevance levels to the following 
gain values: 27 — 1 = 3,2! — 1 = 1, 2° — 1 = 0; according to the present survey of 
NTCIR retrieval tasks, this is the only case where a quadratic gain value setting 
was used instead of the linear one. The evaluation measures used for this subtask 
were nG@1, P+, and normalised ERR (nERR). As for the Japanese subtask which 
used Japanese Twitter data, the same mapping scheme was applied, but the scores 
((L2, L1, LO) = (2, 1, 0)) from 10 assessors were averaged to determine the final 
gain values; a binary relevance, set-retrieval accuracy measure was used instead of 
P+, along with nG@1 and nERR. 

The NTCIR-13 STC task (Shang et al. 2017) was similar to its predecessor, 
although systems were allowed to generate responses instead of retrieving existing 
tweets. In the Chinese subtask, 7-point relevance levels were obtained by summing 
up the assessor scores, and a linear gain value setting was used to compute nG@ 1, 
P+, and nERR. In addition, an alternative approach to consolidating the assessor 
scores was explored, by considering the fact that some tweets receive unanimous 
ratings while others do not even if they are the same in terms of the sum of assessor 
scores (Sakai 2017). The NTCIR-13 STC Japanese subtask used Yahoo! News Com- 
ments data instead of Japanese Twitter data. The evaluation method was similar to 
what was used in the previous Japanese subtask; see Sakai (2019) for more details. 

Although the Chinese Emotional Conversation Generation (CECG) subtask of the 
NTCIR-14 STC subtask (Zhang and Huang 2019) is not exactly a ranked retrieval 
task, we discuss it here as it is a successor of the previous Chinese STC subtasks that 
utilises graded relevance measures. Given an input tweet and an emotional category 
such as Happiness and Sadness, participating systems for this subtask were required 
to return one generated response. A mapping scheme similar to the previous Chinese 
subtasks were used to form 3-point relevance levels. As for the evaluation measures, 
the relevance scores (L2, L1, LO) = (2, 1, 0) of the returned responses were simply 
summed or averaged across the test topics. 


1.3.10 WWW (NTCIR-13 and -14) and CENTRE 
(NTCIR-14) 


The NTCIR-13 We Want Web (WWW) Task (Luo et al. 2017) was an adhoc web 
search task. For the Chinese subtask, three assessors independently judged each 
pooled web page on a 4-point scale: (3, 2, 1, 0); the scores were then summed 
up to form the final 10-point relevance levels. For the English subtask, two assessors 
independently judged each pooled web page on a different 4-point scale: highly 
relevant (2 points), relevant (1 point), nonrelevant (0 points), and error (0 points); 
the scores were then summed up to form the final 5-point relevance levels. In both 
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subtasks, linear gain value settings were used to compute (Microsoft) nDCG, Q (the 
cutoff version (Sakai 2014)), and nERR. 

The NTCIR-14 WWW Task (Mao et al. 2019) was similar to its predecessor. The 
Chinese subtask used the following judgment criteria: highly relevant (3 points), 
relevant (2 points), marginally relevant (1 point), nonrelevant (0 points), garbled (0 
points). Although three assessors judged each topic, the final relevance levels were 
obtained on a majority-vote basis rather than taking the sum; hence 4-point scale 
relevance levels were used this time. As for the English subtask, 5-point relevance 
levels were obtained by following the methodology of the NTCIR-13 English subtask. 
Both subtasks adhered to Microsoft nDCG, (cutoff-based) Q, and nERR with linear 
gain value settings. 

The NTCIR-14 CLEF NTCIR TREC Reproducibility (CENTRE) task (Sakai 
et al. 2019) encouraged participants to replicate a pair of runs from the NTCIR-13 
WWW English subtask and to reproduce a pair of runs from the TREC 2013 Web 
Track adhoc task (Collins-Thompson et al. 2014). Additional relevance assessments 
were conducted on top of the official NTCIR-13 WWW English test collection, by 
following the relevance assessment methodology of the WWW subtask. As for the 
evaluation of the TREC runs with the TREC 2013 Web Track adhoc test collection, the 
original 6-point scale relevance levels Navigational, Key, Highly relevant, Relevant, 
Nonrelevant, Junk were mapped to L4, L3, L2, L1, LO, LO, respectively. All runs 
involved in the CENTRE task were evaluated using Microsoft nDCG, (cutoff-based) 
Q, and nERR, with linear gain value settings. 


1.3.11 AKG (NTCIR-13) 


The NTCIR-13 Actionable Knowledge Graph (AKG) task (Blanco et al. 2017) had 
two subtasks: Action Mining (AM) and Actionable Knowledge Graph Generation 
(AKGG). Both of them involved graded relevance assessments and graded relevance 
measures. The AM subtask required systems to rank actions for a given entity type 
and an entity instance: for example, given “Product” and “Final Fantasy VIII; the 
ranked actions could contain “play on Android,’ “buy new weapons,’ etc. Two sets 
of relevance assessments were collected by means of crowd sourcing: the first set 
judged the verb parts of the actions (“play,” “buy,” etc.) whereas the second set judged 
the entire actions (verb plus modifier as exemplified above). Both sets of judgements 
were done based on 4-point relevance levels. The AKGG subtask required partici- 
pants to rank entity properties: for example, given a quadruple (Query, Entity, Entity 
Types, Action) = (“request funding,” “funding,” “thing, action,’ “request funding”), 
systems might return “Agent,” “ServiceType,” “Result; etc. Relevance assessments 
were conducted by crowd workers on a 5-point scale. Both subtasks used nDCG and 
nERR for the evaluation; linear gain value settings were used.'’ 
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1.3.12 OpenLiveQ (NTCIR-13 and -14) 


The NTCIR-13 OpenLiveQ task (Kato et al. 2017) required participants to rank 
Yahoo! Chiebukuro questions for a given query, and the offline evaluation part of 
this task involved ranked list evaluation with graded relevance. Five crowd workers 
independently judged a list of questions for query g under the following instructions: 
“Suppose you input g and received a set of questions as shown below. Please select all 
the questions that you would want to click.” Thus, while the judgement is binary for 
each assessor, 6-point relevance levels were obtained based on the number of votes. 
(Microsoft) nDCG, Q, and ERR were computed using a linear gain value setting. 

The NTCIR-14 OpenLiveQ-2 task (Kato et al. 2019) is similar to its predeces- 
sor, but this time the evaluation involved unjudged documents, as the relevance 
assessments from NTCIR-13 were reused but the target questions to be ranked were 
not identical to the NTCIR-13 version. The organisers therefore used condensed- 
list (Sakai 2014) versions of Q, (Microsoft) nDCG, and ERR. Also, for OpenLiveQ- 
2, the organisers switched their primary measure from nDCG to Q, as Q substantially 
outperformed nDCG (at / = 5, 10, 20) in terms of correlation with online (i.e., click- 
based) evaluation in their experiments (Kato et al. 2018). 


1.4 Summary 


Table 1.1 summarises Sect. 1.2; Table 1.2 summarises Sect. 1.3. It can be observed 
that (a) the majority of the past NTCIR ranked retrieval tasks utilised graded relevance 
measures; and that (b) even a few relatively recent tasks, namely, SpokenQuery& 
Doc and MathIR from NTCIR-12 held in 2016, refrained from using graded rele- 
vance measures. As was discussed in Sect. 1.2.1, researchers should be aware that 
binary relevance measures with different relevance thresholds (e.g., Relaxed AP and 
Rigid AP) cannot serve as substitutes for good graded relevance measures. /f the 
optimal ranked output for a task is defined as one that sorts all relevant documents in 
decreasing order of relevance levels, then by definition, graded relevance measures 
should be used to evaluate and optimise the retrieval systems. 

One additional remark regarding Tables 1.1 and 1.2 is that the NTCIR-5 CLIR 
overview paper (Kishida et al. 2007) was the last to report on RP curves; the RP 
curves completely disappeared from the NTCIR overviews after that. This may be 
because (a) interpolated precisions at different recall points (Sakai 2014) do not 
directly reflect user experience; and (b) graded relevance measures have become 
more popular than before. 

Over the past decade or so, some researchers have pointed out a few disadvan- 
tages of using graded relevance, especially in the context of promoting preference 
judgements (e.g., Bashir et al. 2013; Carterette et al. 2008). Carterette et al. (2008) 
argue that (i) it is difficult to determine relevance grades in advance and to anticipate 
how the decision will affect evaluation; and (ii) having more grades means more 
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Table 1.1 NTCIR ranked retrieval tasks with graded relevance assessments and binary relevance 
measures. Note that the relevance levels for the Patent Retrieval tasks of NTCIR-4 to -6 exclude the 
“nonrelevant” level: the actual labels are shown here because they are not simply different degrees 


of relevance (See Sect. 1.2.2) 


Task or subtask NTCIR round (year) | Relevance levels Main evaluation 
measures discussed in 
overview 

Japanese and JEIR 1 (1999) 3 AP, R-precision, 
Precision, RP curves 

JEIR 2 (2001) 4 AP, R-precision, 
Precision, Interpolated 
Precision, RP curves 

Chinese and CEIR 2 4 per assessor RP curves 

CLIR 3-5(2002-2005) 4 AP, RP curves 

Patent retrieval 3 (2002) 4 RP curves 

Patent retrieval 4 (2004) A,B AP, RP curves 

Patent retrieval 5 (2005) A,B CRS (for passage 
retrieval), AP 

Patent retrieval 6 (2007) A,B/H,A,B (Japanese) | AP 

A,B (English) AP 

Spoken 9-11(2011-2014) 3 AP and passage-level 

document/content variants 

retrieval 

SQ-SCR (SGS) 12 (2016) 3 AP 

Math retrieval 10 (2013) 5 mapped to 3 AP, Precision 

Math retrieval 11 (2014) 3 AP, Precision, Bpref 

MathIR 12 (2016) 3 Precision 


burden on the users. Regarding (i), while it is important to always check how our 
use of grades affects the evaluation outcome, in many cases relevance grades can 
be naturally defined based on individual assessors’ labels; I argue that it is useful to 
preserve the raw judgements in the form of graded relevance rather than to collapse 
them to binary; see also the discussion below on label distributions. Regarding (ii), 
rich relevance grades can be obtained even if the individual judgements are binary 
or tertiary, as I have illustrated in this chapter. Moreover, while I agree that simple 
side-by-side preference judgements are useful (and can even be used for construct- 
ing graded relevance data), it should be pointed out that some of the approaches 
in the preference judgements domain require more complex judgement protocols 
than this, e.g., graded preference judgements (Carterette et al. 2008), and contextual 
preference judgements (Chandar and Carterette 2013; Golbus et al. 2014). Moreover, 
while I agree that utilising preference judgements is a promising avenue for future 
research, the incompleteness problem of preference judgements needs to be solved. 

What lies beyond graded relevance then? Here is my personal view concerning 
offline evaluation (as opposed to online evaluation using click data etc.). Information 
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Table 1.2 NTCIR ranked retrieval tasks with graded relevance assessments and graded relevance 
measures. Binary relevance measures are shown in parentheses 


Task or subtask NTCIR round (year) | Relevance levels Main evaluation 
measures discussed in 
overview 

Web retrieval 3 (2003) 4 + best documents DCG ((W)RR, AP, RP 
curves) 

WEB informational 4 (2004) 4 DCG ((W)RR, 
Precision, RP curves) 

WEB navigational DCG, ((W)RR, UCS) 

WEB navigational 5 (2005) DCG, ((W)RR) 

CLIR 6 (2007) nDCG, Q, generalised 
AP (AP) 

IR4QA 7-8 (2008-2010) 3 nDCG, Q (AP) 

GeoTime 8-9(2010-2011) 3x nDCG, Q (AP) 

CQA 8 (2010) 4(9) + best answers GA-{nG@1, nDCG, 
Q}, (GA-Hit@ 1, 
BA-Hit@ 1) etc. 

INTENT DR 9 (2011) 5 Dg-nDCG 

INTENT DR 10 (2013) 5 D#-nDCG, 
DIN-nDCG, P+Q 

IMine DR 11 (2014) 4 incl. Spam Dg-nDCG 

IMine TaskMine 11 6 nDCG 

IMine QU 12 (2016) 3 (vertical) QU-score 

IMine VI 12 3 (vertical) Dt-nDCG, nDCG 

3 (intentwise) 
3 + Spam (topicwise) 

RecipeSearch 11 (2014) 3(2) nDCG (AP, RR) 

Temporalia TIR 11 3 nDCG, Q, (Precision) 

Temporalia TDR 12 (2016) 3 nDCG, a-nDCG, 
Dg-nDCG 

STC Chinese 12 3 nG@1, P+, nERR 

STC Chinese 13 (2017) 7(10) nG@1, P+, nERR 

STC Japanese 12-13(2016-2017) 3 per assessor nG@1, nERR 
(Accuracy) 

STC CECG 14 (2019) 3 Sum/average of 
relevance scores 

WWW English 13-14(2017-2019) 5 nDCG, Q, nERR 

WWW Chinese 13 (2017) 10 nDCG, Q, nERR 

WWW Chinese 14 4 nDCG, Q, nERR 

AKG 13 (2017) 4(AM)/5(AKGG) |nDCG, nERR 

OpenLiveQ 13-14(2017-2019) 6 nDCG, Q, ERR(with 
condensed lists at 
NTCIR-14) 

CENTRE 14 (2019) 5 nDCG, Q, nERR 


two types of partially relevant (when and where) counted as one level 
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Retrieval (IR) and Information Access (IA) tasks have diversified, and relevance 
assessments require more subjective and diverse views than before. We are no longer 
just talking about whether a scientific article is relevant to the researcher’s question (as 
in Cranfield); we are also talking about whether a response of a chatbot is “relevant” 
response to the user’s utterance, about whether a reply to a post on social media is 
“relevant,” and so on. Graded relevance implies that there should be a single label for 
each item to be retrieved (e.g., “this document is highly relevant”), but these new tasks 
may require a distribution of labels reflecting different users’s points of view. Hence, 
instead of collapsing this distribution to form a single label, methods to preserve 
the distribution of labels in the test collection may be useful, as was implemented 
at the Dialogue Breakdown Detection Challenge (Higashinaka et al. 2017). The 
Dialogue Quality (DQ) and Nugget Detection (ND) subtasks of the NTCIR-14 STC 
task were the very first of NTCIR efforts in that direction: they compared gold label 
distributions with systems’ estimated distributions (Sakai 2018; Zeng et al. 2019). 
See also Maddalena et al. (2017) for a related idea. 
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Chapter 2 A) 
Experiments on Cross-Language geai 
Information Retrieval Using Comparable 
Corpora of Chinese, Japanese, 

and Korean Languages 


Kazuaki Kishida and Kuang-hua Chen 


Abstract This paper describes research activities for exploring techniques of cross- 
language information retrieval (CLIR) during the NACSIS Test Collection for Infor- 
mation Retrieval/NII Testbeds and Community for Information access Research 
(NTCIR)-1 to NTCIR-6 evaluation cycles, which mainly focused on Chinese, 
Japanese, and Korean (CJK) languages. First, general procedures and techniques 
of CLIR are briefly reviewed. Second, document collections that were used for the 
research tasks and test collection construction for retrieval experiments are explained. 
Specifically, CLIR tasks from NTCIR-3 to NTCIR-6 utilized multilingual corpora 
consisting of newspaper articles that were published in Taiwan, Japan, and Korea dur- 
ing the same time periods. A set of articles can be considered a “pseudo” comparable 
corpus because many events or affairs are commonly covered across languages in the 
articles. Such comparable corpora are helpful for comparing the performance of CLIR 
between pairs of CJK and English. This comparison leads to deeper insights into 
CLIR techniques. NTCIR CLIR tasks have been built on the basis of test collections 
that incorporate such comparable corpora. We summarize the technical advances 
observed in these CLIR tasks at the end of the paper. 


2.1 Introduction 


A “comparable corpus” can be defined as multiple sets of documents, each in dif- 
ferent languages, which approximately describe the same things or events. Unlike a 
parallel corpus, explicit alignments of words, sentences, paragraphs, or documents 
are not necessarily contained in the comparable corpus. In this sense, pairs of scien- 
tific abstracts written in Japanese and English that were used for retrieval experiments 
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during the first and second NACSIS Test Collection for Information Retrieval/NII 
Testbeds and Community for Information access Research (NTCIR) evaluation 
cycles (i.e., NTCIR-1 and -2) as test documents can be considered document-linked 
comparable corpora. 

Scholarly journals or conference proceedings published in Japan often ask authors 
to attach an English title and abstract to their Japanese paper to promote scientific 
communication. Such a set of Japanese and English titles and abstracts is a parallel 
corpus, in which explicit alignments of titles or abstracts may be included if the 
authors attempted to write the English title or abstract such that they were equivalent 
to those in Japanese. Even though all the authors did not necessarily do so, the set 
can be regarded as a comparable corpus at least. 

In NTCIR-1, a corpus of such titles and abstracts was used for experiments of 
cross-language information retrieval (CLIR) in which English (E) documents were 
searched for Japanese (J) queries (i.e., a J to E bilingual search). Note that even if 
only a monolingual corpus in English is available, J to E bilingual searching can be 
tested by creating Japanese queries as search topics. However, Japanese and English 
comparable (or parallel) corpora allow us to compare results of J to E and E to J 
searching in a controlled setting, as the two target document sets in Japanese and 
English are topically similar. This type of comparison would play an important role 
in developing more sophisticated CLIR techniques. Actually, in NTCIR-2, a research 
task of E to J searching was added. 

This policy of designing CLIR experiments based on comparable corpora had been 
maintained for NTCIR-3 to -6, in which CLIR between Chinese (C), Japanese (J), 
Korean (K), and English (E) was explored as one of the research tasks. More specif- 
ically, as target documents, NTCIR CLIR tasks used newspaper articles published 
in Taiwan, Japan, and Korea during the same time periods, which can be considered 
to be topically sufficiently comparable because they include many descriptions of 
common events and affairs occurring globally or locally in regions of East Asia. 
Actually, a comparison between pairs of CJKE languages based on such document 
sets largely contributed to the development of CLIR techniques between the CJKE 
languages even though the sets of the CJKE newspaper articles were “more loosely” 
comparable corpora than the sets of Japanese and English titles and abstracts in 
NTCIR-1 and -2. 

This paper mainly describes research efforts of CLIR tasks from NTCIR-3 to -6. 
Specifically, construction of test collections based on so-called “pseudo comparable 
corpora” (i.e., time- and region-aligned newspaper article sets) and CLIR techniques 
that were explored by research groups participating in the NTCIR CLIR tasks are the 
focus. In addition, CLIR experiments in NTCIR- 1 and -2 are briefly mentioned before 
reviewing the NTCIR CLIR tasks. The NTCIR-3 CLIR task started on September 
in 2001 and the NTCIR-6 CLIR task ended on May 2007. Therefore, readers can 
understand the technical development of CLIR among CJKE during the time period 
from a historical perspective. 
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2.2 Outline of Cross-Language Information Retrieval 
(CLIR) 


Before describing research efforts of NTCIR CLIR tasks, this section gives a con- 
cise, general overview of CLIR operations. Grefenstette (1998), Oard and Diekema 
(1998), Nie (2010), and Peters et al. (2012) provide more in-depth coverage of CLIR. 
Note that this section is based on a review article (Kishida 2005), which includes an 
exhaustive reference list on CLIR techniques that this section describes. 


2.2.1 CLIR Types and Techniques 


Some form of CLIR is required when a search query and target documents are written 
in different languages. If only a single language is used in documents then the task 
is termed bilingual information retrieval (BLIR). An example is J to E searching, in 
which only English documents are involved. In the case of multilingual information 
retrieval (MLIR), the target set consists of documents in two or more languages. In the 
NTCIR CLIR tasks, the most difficult challenge was to search a set of documents in 
four languages (CJKE). Note that if a query is written in C then standard monolingual 
information retrieval (i.e., C to C searching) may be included as a part of MLIR on the 
CJKE documents. Monolingual IR was specifically referred to as single language IR 
(SLIR) in the NTCIR CLIR tasks. Therefore, NTCIR CLIR tasks had three subtasks: 
SLIR, BLIR and MLIR. 

Generally, research efforts of CLIR can be traced back to a work by Gerald Salton 
in 1970 (Kishida 2005). Many researchers had attempted to develop CLIR techniques, 
particularly since the 1990s following popularization of the Internet. At that time, 
the main research task was to explore cross-lingual techniques for conventional ad 
hoc IR, which was also focused on by NTCIR CLIR tasks. However, it is possible 
to apply cross-lingual techniques to other applications related to ad hoc IR. 

An important operation for CLIR is to translate a query and/or individual docu- 
ments. If the query is perfectly translated into a language of the target documents via 
machine translation (MT) software then CLIR transforms back to normal monolin- 
gual IR. However, the translation is often incomplete because the queries are gen- 
erally short and ambiguous (Oard and Diekema 1998). For example, when a query 
including only two single words “mercury earth” is entered into a search engine, the 
“mercury” in the source language has to be correctly translated into an equivalent 
that corresponds to a planet in the target language, not the chemical substance, in 
most cases. Sense disambiguation is often difficult because the queries may not con- 
tain sufficient contextual information for determining the correct meaning of each 
query term. To maintain the accuracy of the translation, it may be better to translate 
documents that are typically longer than the queries although document translation 
is more time-consuming in comparison to query translation. Another difficulty using 
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document translation is that index files of the IR system increase in size because 
translations have to be added as index terms. 

Therefore, CLIR techniques typically consist of two main modules: (1) translation 
and (2) monolingual IR. Their effectiveness has an influence on the overall CLIR 
performance in the case that translation and monolingual searching independently 
work to generate a final search output, which would be a typical architecture of 
CLIR systems. However, there are IR models more sophisticatedly incorporating 
both modules. For example, language modeling (LM) can elegantly implement a 
CLIR operation by combining two conditional probabilities p(s|t) and p(t|d) during 
a process of computing document scores for ranked output where s and t denote a 
query term and a term in document d, respectively (Xu et al. 2001). Particularly, 
p(s|t) is termed as translation probability. 


2.2.2 Word Sense Disambiguation for CLIR 


As previously exemplified by an instance of “mercury,” word sense disambiguation 
(WSD) is important in CLIR. Typical methods for WSD in CLIR utilize (1) part-of- 
speech (POS) tags, (2) term co-occurrence statistics in the target document set, and 
(3) pseudo relevance feedback (PRF) techniques. 

When POS tags are used, target terms having the same POS tags as the source 
term are selected from a set of candidates as final query terms. The candidate target 
terms can be easily obtained from a machine-readable bilingual dictionary. 

In the case of utilizing term co-occurrence statistics in the target document set, 
the operation is more complicated. It is assumed that two translations ft; and t are 
extracted from a bilingual dictionary for a query term and that other translations u1 
and u% are similarly obtained for another term in the same query. If f and u; are 
semantically correct translations in the context of the given query then it is expected 
that tı and uw; co-occur more frequently in the target corpus than a pair of tı and u2 
and that of t, and u1. Therefore, the co-occurrence frequencies aid in selecting final 
query terms in the target language, which is a basic assumption of the disambiguation 
method. When a large number of terms are included in an original source query, too 
many translations may be extracted from the dictionary. Because selection of final 
query terms is computationally expensive in such cases, some special techniques for 
solving the problem have been explored thus far (Kishida 2005). 

Whereas the co-occurrence frequencies have to be computed before actual search- 
ing, such a type of preparatory work is not required for applying disambiguation 
techniques based on PRF. Instead, the searching operation is repeated during the 
process, which may be time-consuming in a real situation. That is, first, the target 
document collection is searched for a set of all translations that were obtained from 
a dictionary, and thereafter, final query terms are selected from the set of top-ranked 
documents (e.g., from the top 30 documents). Searching for the selected query terms 
is again repeated to obtain a final result. 
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Originally, the PRF attempts to expand a given query by adding some “signifi- 
cant” terms in the top-ranked documents to the query under an assumption that they 
also indicate an information need represented by the original query. The newly added 
terms may mainly contribute to enhancing the recall ratio. In the context of disam- 
biguation for CLIR, it is expected that documents including a semantically correct 
combination of target terms (e.g., tı and u; in the aforementioned example) are at a 
higher position in a ranked list of the first search (Ballesteros and Croft 1997). As a 
result, terms that co-occur in the top-ranked documents tend to be selected, which has 
the same effect as using term co-occurrence statistics. Thus, the selection based on 
the top-ranked documents works incidentally as a system for disambiguation. Note 
that the co-occurrences in the top-ranked documents are limited to a local context 
of the original query, unlike term co-occurrence statistics in the entire document set. 
Final query terms are typically selected according to term weights that are calculated 
using a formula of standard PRF techniques (Kishida 2005). 


2.2.3 Language Resources for CLIR 


As mentioned previously, a typical language resource for implementing CLIR is a 
machine-readable bilingual dictionary or MT software. When both the dictionary 
and the software are not available for a given pair of source and target languages, it is 
possible to apply a pivot language approach. For example, even if a resource between 
Japanese and Swedish (S) is not found, J to English and English to S resources allow 
us to execute J to S bilingual searching, where English is a pivot language. More 
specifically, by translating each Japanese query term into English equivalents and 
converting them again to Swedish terms, a final Swedish query can be obtained. Thus, 
the resulting Swedish query can be used for retrieval of the Swedish documents. 
Because English is an international language, many language resources related to 
English are actually available. 

In addition, parallel corpora play an important role in CLIR. Without a dictionary 
or MT software, CLIR can be executed by searching a parallel corpus for a query 
written in the source language. That is, because textual data that were found via 
searching have another part in the target language, it is possible to extract final query 
terms in the target language from the data. Additionally, a parallel corpus consisting 
of sentence alignments can be used for estimating translation probabilities, in which 
the well-known IBM Model 1 for statistical MT has often been applied. The list of 
the translation probabilities works as a bilingual dictionary, and is indispensable for 
LM-based CLIR (see Sect. 2.2.1). 

Of course, standard language processing tools such as a POS tagger (or a mor- 
phological analyzer), a stemmer, and a named entity recognizer are also employed 
in CLIR. 
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2.3 Test Collections for CLIR from NTCIR-1 to NTCIR-6 


A main contribution of the NTCIR CLIR tasks is to examine whether or not the 
CLIR techniques that were reviewed in the previous section can be applied to the 
CJK languages and to enhance the techniques by tailoring them to situations in which 
CJK languages are used. When the NTCIR CLIR task started, the Chinese language 
had been already explored in the Text REtrieval Conference (TREC) (Voorhees and 
Harman 2005). In contrast, a systematic and large-scale CLIR experiment related 
to the Japanese and Korean languages would be considered as an original NTCIR 
contribution. This section provides a simple overview of test collections on which 
various trial-and-error attempts were made in NTCIR from the very beginning. 


2.3.1 Japanese-English Comparable Corpora in NTCIR-1 
and NTCIR-2 


As previously mentioned, a set of Japanese and English titles and abstracts in confer- 
ence proceedings that were published by Japanese academic societies was a source 
of documents in NTCIR-1. More specifically, in total, 339,483 bibliographic records 
of conference papers were collected. Because the set included three types of records 
having (1) only Japanese abstracts, (2) only English abstracts, and (3) both Japanese 
and English abstracts, the set of Japanese documents (J collection) and the set of 
English documents (E collection) were constructed as a subset of the whole set (JE 
collection). Research groups participating in NTCIR-1 were able to use the three sets 
during IR experiments (Kando et al. 1999). 

All search requests for the experiments (i.e., search topics) were written in 
Japanese (30 topics for training and 53 topics for evaluation). Therefore, it was 
possible for the participants to examine only J to E bilingual searching as CLIR 
experiments. The NTCIR-1 conference was held in September of 1999, which would 
be the first opportunity for discussing internationally CLIR issues related to Japanese 
language. 

In NTCIR-2, by adding bibliographic records of some scientific reports published 
in Japan, the document sets were substantially extended. The English and Japanese 
versions of 49 search topics were prepared by the task organizers (Kando et al. 2001). 
The test collection allowed the participants to experiment in E to J and J to E bilingual 
searching and in J to JE and E to JE multilingual searching. 
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2.3.2 Chinese-Japanese-Korean (CJK) Corpora from 
NTCIR-3 to NTCIR-6 


Based on the knowledge that was obtained by the efforts of NTCIR-1 and -2, more 
sophisticated CLIR experiments involving C, J, K, and E were started as an inde- 
pendent task beginning with NTCIR-3. In the CLIR tasks from NTCIR-3 to -6, 
newspaper articles that were collected from various news agencies in East Asia were 
employed as target documents. Each record of the articles included its headline and 
full text. Table 2.1 summarizes the document sets; the number of documents is indi- 
cated in Table 2.2. Note that the CLIR task of NTCIR-6 had two stages (i.e., stages 1 
and 2). The purpose of stage 2 was to obtain a more reliable measurement of search 
performance. Newspaper articles published in 1998 and 1999 were basically used 
for experiments in NTCIR-3 and -4 whereas newspaper articles for NTCIR-5 and 
stage 1 in NTCIR-6 were from 2000 and 2001. 

For some reason, only the Korean document set in NTCIR-3 consisted of news- 
paper articles in 1994. However, from NTCIR-4, newspaper articles matching time 
periods (i.e., 1998—99 and 2000-01) were provided as CJKE document sets for exper- 
iments (English documents were out of scope in NTCIR-6). As previously discussed, 
the sets can be considered as types of comparable corpora because the newspaper 
articles in the sets were commonly concerned with worldwide or East Asian events 
and affairs of the time, allowing a CLIR performance comparison between the pairs of 
CJKE languages partly because documents in the individual languages are topically 
homogeneous to some extent. Notably, the Chinese documents were represented by 
only traditional Chinese characters, not simplified ones. 

A newspaper article is typically written for general audiences; its text is relatively 
plain and shorter in comparison to that of scientific or technical papers. There is no 
explicit structure in the text of newspaper articles except for a headline and para- 
graphs, which is different from XML documents having a more complex structure. 
Additionally, newspaper article records in NTCIR CLIR tasks did not include any 


Table 2.1 Document sets used by NTCIR CLIR tasks" 


Period of tasks Date of newspaper Set for MLIR 
articles 
NTCIR-3 2001-02 C, J, E: 1998-99, CJ, CE, JE, CJE 
K:1994 
NTCIR-4 2003-04 C, J, K, E: 1998-99 CJE, CJKE 
NTCIR-5 2004-05 C, J, K, E: 2000-01 CJKE 
NTCIR-6 2006-07 
Stagel C,J,K:2000-01 CJK 
Stage2 NTCIR-3, —4, —5 test 


collections? 


4Search topics in C, J, K, and E were created for the document sets 
>In stage 2 of NTCIR-6, a cross-collection analysis was attempted 
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Table 2.2 Number of records in document sets in the NTCIR-3 to -6 CLIR tasks 
Language | No. of records Usage (denoted by the mark x) 


1994 
Korean Korea Economic Daily: 66,146 x x 
1998-99 
Chinese UDN® +others: 381,375 x x x 
Japanese | Mainich: 220,078 x x x 
Yomiuri: 373,558 X x 
Korean Hankookilbo+Chosunilbo: 254,438 x x 
English Mainichi Daily+EIRB?: 22,927 x x 
Xinhua+others: 324,449 x 
2000-01 
Chinese UDN® +others: 901,446 x x x 
Japanese | Mainichi+ Yomiuri: 858,400 x x x 
Korean Hankookilbo+Chosunilbo: 220,374 X x x 
English Xinhua+others: 259,050 X 


UDN: United Daily News 
PEIRB: Taiwan News and China Times English News 


topic keywords such as descriptors that are often assigned to bibliographic records of 
scientific papers. Today various types of documents are exploited for current research 
on IR or related areas, but the test collections using such newspaper articles still pro- 
vide IR researchers a sound experimental setting for examination of fundamental 
techniques that underlie more complicated searches. 


2.3.3 CJKE Test Collection Construction 


Test collections incorporating the CJKE documents were constructed according to a 
traditional pooling method explored by TREC. In general, a test collection consists 
of three components: a document set, topic set (set of search requests), and answer set 
(result of relevance judgments). By employing the answer set, metrics for evaluating 
IR results such as precision or recall can be computed. When calculating the recall, it 
is required to determine all relevant documents included in the document set, which 
is typically impossible for large-scale document sets. Therefore, the pooling method 
was developed for using such large sets. Figure 2.1 shows an operational model for 
IR evaluation based on the pooling method. 

First, a document set such as that shown in Table 2.1 is sent to participants in the 
tasks for implementing into their own IR systems. Then, task organizers deliver a 
topic set to participants and ask them to submit search results by the designated day. 
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Research 
Results 


Fig. 2.1 Construction of test collection and evaluation 


Under management of the task organizers, the degree of relevance is judged for each 
pair of a topic and a document included in the search results, by which an answer set 
is obtained. Finally, the search performance of the participating IR system is scored 
based on the answer set. By checking the scores, the advantages or disadvantages of 
IR theories or techniques are clarified. Because the relevance judgment is completed 
for pooled documents that are extracted from the search results that participants 
submitted, and not for the entire set of documents, this procedure for creating the 
answer set is termed the pooling method, which is an efficient means for constructing 
a large-scale test collection. Strictly speaking, scores of some evaluation metrics 
obtained from this procedure are only approximations because the entire set is not 
examined. However, a comparison of search effectiveness between the IR systems 
or models within the test collection is sufficiently feasible. 

The organizers of the NTCIR CLIR tasks consisted of IR researchers in Taiwan, 
Japan, and Korea who collaboratively worked in designing the research tasks, creating 
the topics, managing the relevance judgment process, and evaluating the participating 
IR systems. The authors of this paper were members of the organizer group. 

In our experience, it was difficult to create topics that were effective for measur- 
ing CLIR performance between the CJKE languages compared to a case of simple 
monolingual IR. The typical procedure for topic creation in the NTCIR CLIR tasks 
was as follows: 


1. Topic candidates were created in Taiwan, Japan, and Korea, respectively, and 
were translated into English. 

2. The English candidates were again translated into Chinese, Japanese, and Korean 
as necessary, and the task organizers preliminarily examined whether or not rel- 
evant documents were sufficiently included in the C, J, K, and E document sets. 

3. Final topics were selected based on the preliminary examination result. 
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This complicated procedure was adopted for using topics commonly on all document 
sets in the CJKE languages, by which comparisons of search performance between 
bilingual searches of CJKE (e.g., between C to J and J to C searches that are related 
to processing of Chinese and Japanese texts) became easier. 

Search topics created for NTCIR CLIR tasks can be approximately classified into 
two types: 1) event-based topics and 2) general concept-based topics. Event-based 
topics typically contain one or more proper nouns of a social event, geographic 
location, or person. An example is “Find reports on the G8 Okinawa Summit 2000” 
(ID 005 in NTCIR-5). If a CLIR system cannot find any corresponding translation 
of the proper noun during its process then the search performance is expected to be 
low. This is generally termed an out-of-vocabulary (OOV) problem. 

Meanwhile, it may be relatively easier to find translations of a general concept, but 
CLIR systems often need to disambiguate translation candidates for the concept. For 
example, a correct translation in the context of the search topic often has to be selected 
from many terms listed in a bilingual dictionary (see Sect. 2.2.2). An instance of the 
general concept-based topics is “Find documents describing disasters thought to be 
caused by abnormal weather” (ID 044 in NTCIR-5). Even though “weather” has a 
relatively definite meaning, many translations are actually enumerated in an E to J 
dictionary and selection of final translations substantially affects CLIR performance. 
The task organizers considered a careful balance of the two topic types for allowing 
researchers to develop more effective systems. During the topic creation process, 
approximately 50 topics were included in each of the test collections for NTCIR-3 
to -6, respectively. 

Needless to say, jobs for pooling documents also are not easy. If the pool (i.e., a 
document set to be checked during a process of relevance judgment) is too large then 
it is impossible for an assessor to maintain consistent judgment for all documents. 
For avoiding this problem, the document pool size has to be appropriately adjusted 
when extracting top-ranked documents from the search results of each participant. 
This is a special matter of so-called pooling depth (Kuriyama et al. 2002). 

Also, a system of relevance judgments developed by National Institute of Infor- 
matics (NII), of which name was NACSIS at the time, was used for providing 
the assessors with a comfortable human-machine interface for the judgment task, 
which contributed to enhancing consistency and reliability of the judgment results. 
A windows-based assessment system created in Taiwan for the special purpose is 
explained by Chen (2002). 


2.3.4 IR System Evaluation 


During the process of relevance judgment, the assessors evaluated each document 
using a four-point scale: 1) highly relevant, 2) relevant, 3) partially relevant, and 4) 
irrelevant. The IR research field has a long history of studying relevance concepts 
and operational assessment of them. A multi-grade assessment based on the four- 
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point scale was adopted in ad hoc IR tasks beginning with NTCIR-1 after carefully 
examining discussions in the literature of relevance. 

However, evaluation metrics based on a multi-grade assessment such as the 
Normalized Discounted Cumulative Gain (nDCG) were not yet popular at the time of 
NTCIR-1 to -6 and the main indicator for evaluating IR systems was Mean Average 
Precision (MAP).! For calculating the average precision, the four-point scale has to 
be reduced to a binary scale. When “highly relevant” and “relevant” were consid- 
ered to be relevant and the others to be irrelevant, it was specifically termed “rigid” 
relevance in the NTCIR CLIR tasks. If “partially relevant” was included in the rele- 
vant category then “relaxed” relevance was used. Therefore, in NTCIR CLIR tasks, 
two MAP scores were typically computed for a single search result based on rigid 
and relaxed relevance, respectively. Sakai (2020) summarizes evaluation metrics and 
methods in the overall NTCIR project. 


2.4 CLIR Techniques in NTCIR 


This section briefly summarizes typical techniques used in CLIR tasks from NTCIR- 
3 to -6. For knowing details of the techniques and systems, overviews of each task 
that were published at NTCIR conferences are helpful (Chen et al. 2002; Kishida 
et al. 2004, 2005, 2007). Lists of research groups participating in each task are also 
included in the overviews. 


2.4.1 Monolingual Information Retrieval Techniques 


IR systems of groups participating in NTCIR CLIR tasks typically have two indepen- 
dent components for 1) monolingual IR and 2) translation as explained in Sect. 2.2.1. 
Because computer processing of Chinese, Japanese, and Korean textual data had not 
yet been sufficiently developed at the time, NTCIR CLIR tasks also contributed to 
obtaining useful knowledge regarding CJK text processing for monolingual IR (or 
single language IR: SLIR). The resulting SLIR performance improvement can be 
considered as an achievement in the NTCIR CLIR tasks. 

Particularly, sentences or phrases in the CJK texts have no explicit word boundary, 
which is a characteristic that is different from that of English texts (note that Korean 
texts include white spaces as a delimiter between phrasal units). To construct index 
files in SLIR systems for these languages, either 


1. Word-based indexing, or 
2. Overlapping character bigrams (i.e., n-grams when n = 2) 


‘Only in NTCIR-6, nDCG was used for evaluating CLIR performance as a trial. 
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were typically used in the NTCIR CLIR tasks. For word-based indexing, some groups 
employed morphological analyzers, whereas index terms were identified from texts 
by simply matching with entries of machine-readable dictionaries in some other 
systems. 

Extracting character bigrams from texts are a characteristic during the indexing 
process of East Asian languages. Assume that a Japanese sentence is “ABCDE” 
where A, B, C, D, and E are a Japanese character, respectively. In the case of over- 
lapping character bigrams, “AB,” “BC,” “CD,” and “DE” are automatically selected 
as index terms. This was known as an effective method for processing texts that were 
represented by ideograms. Although character unigrams (i.e., n-grams when n = 1) 
are extracted from the target text in the current Internet search engines or some online 
public access catalog (OPAC) systems, n = 2 was used in NTCIR CLIR tasks. 

By utilizing an index file constructed according to an indexing method, documents 
have to be ranked by the degree of relevance to each search query. The relevance 
degree is operationally estimated in the system based on a retrieval model. In NTCIR 
CLIR tasks, participant groups typically adopted some standard and well-known 
models such as the vector space mode (VSM), Okapi BM25, LM, INQUERY, PIRCS, 
or logistic regression model. In addition, query expansion (QE) by PRF or techniques 
using external resources (e.g., statistical thesauri based on term co-occurrence statis- 
tics or web pages) were incorporated for enhanced search performance. The retrieval 
models and QE techniques were originally developed in the USA or Europe mainly 
for English IR. NTCIR CLIR tasks provided good opportunities for systematically 
confirming their effectiveness for IR of CJK languages. 


2.4.2 Bilingual Information Retrieval (BLIR) Techniques 


Section 2.2 reviewed typical CLIR techniques, which were also utilized in NTCIR 
CLIR tasks. Dictionaries and MT software that were employed by participants in the 
NTCIR-4 CLIR task were extensively enumerated in Kishida et al. (2004). 

Specifically, important problems to be solved for translation in CLIR among CJKE 
were as follows. 


1. Query translation versus document translation: Most participating groups 
adopted a means of translating search topics (queries), whereas some explored 
“pseudo” document translation in which terms in target documents were sim- 
ply replaced with equivalents in another language using a bilingual dictionary 
(i.e., not MT). Additionally, search performance may be improved by combining 
search results from both the query and document translations because it is pos- 
sible that the probability of successful matching of terms between a topic and a 
relevant document increases. This technique was attempted by one group. 

2. Pivot language approach: English was typically used as a pivot language for 
CLIR among CJK, whereas one group attempted bilingual searching via Japanese. 
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Selection of the pivot language depends on translation resources such as MT 
software. 

3. OOV problem: As previously mentioned, when a term representing an important 
concept in a search topic is not included in the dictionaries for translation, search 
performance largely degrades. Some search topics in the NTCIR CLIR tasks 
contained names related to current events or affairs and they were not often 
covered if using only a standard bilingual dictionary (see Sect. 2.3.3). For solving 
this problem, some groups attempted to extract translations from web pages for 
the unknown term. 

4. Automatic transliteration: In general, when a word in a foreign language is 
imported, transliteration is often used without semantically representing the word 
in its own language. For example, an English word “hotel” is transliterated into 
three Katakana characters corresponding phonetically to “ho,” “te,” and “ru” in 
Japanese. Although popular Katakana words are listed in standard bilingual dic- 
tionaries, an OOV problem occurs if this is not the case. At this time, an English 
word may be automatically converted into a Katakana word (and vice versa) 
via heuristic rules phonetically measuring the similarity between them (Fujii and 
Ishikawa 2001). This type of automatic transliteration was explored in the NTCIR 
CLIR tasks. 

5. Conversion of Kanji character codes: An idea similar to automatic transliteration 
is automatic conversion of Kanji characters between Chinese and Japanese. In the 
NTCIR CLIR tasks, one group attempted to convert traditional Chinese characters 
encoded by the BIGS character code into Japanese characters represented by 
Extended Unix Code-Japanese (EUC-JP). 

6. Term disambiguation (or WSD): A typical method for term disambiguation was 
to use statistical information of term co-occurrences in the set of target docu- 
ments. In addition, many CLIR systems incorporated a PRF process, which had 
an effect of increasing the rank of documents that included a combination of 
correct translations (see Sect.2.2.2). Both methods do not require any external 
resource. In contrast, some external resources such as web pages or parallel cor- 
pora were also applied for term disambiguation by some groups. For example, 
one system attempted to select final query terms based on web pages that were 
extracted from a web category to which the search topic corresponded. 


As a technique for improving CLIR performance, pre-translation PRF was 
explored in the NTCIR CLIR tasks. That is, if a corpus in the source language 
of an original query is available as an external resource, then a PRF operation on 
the external resource may result in a set of more useful query terms, termed the 
pre-translation PRF. After obtaining a “richer” representation of the original query 
in the source language using it, standard CLIR is executed based on the modified 
query. More common was to include PRF in the form of post-translation PRF after 
the retrieval process proper. A combination of pre- and post-translation PRF was 
used in some groups in the NTCIR CLIR tasks. 

In addition, participants in the NTCIR CLIR tasks attempted to address other var- 
ious challenges such as document re-ranking, QE via a statistical thesaurus, trans- 
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Table 2.3 Best MAP scores of SLIR and BLIR in NTCIR-6: Rigid relevance, DESC field 


Documents (X) 


Search topics Chinese Japanese Korean 
Monolingual 0.313 (100%) 0.325 (100%) 0.454 (100%) 
(baseline) 

BLIR 

Chinese (C to X - 0.312 (95.8%) N/A 

search) 

Japanese (J to X 0.078 (24.7%) - 0.287 (63.2%) 
search) 

Korea (K to X search) | 0.102 (32.6%) 0.267 (82.1%) - 

English (E to X 0.191 (61.0%) 0.307 (94.4%) 0.292 (64.3%) 
search) 


* A short sentence describing each search topic was included in a <DESC> element of an XML file 
of the topics. The sentence was used as a query for executing searches in this table 


lation probability estimation, the use of an ontology to enhance the effectiveness of 
mono- or cross-lingual IR. Although similar research efforts had already been com- 
pleted in TREC or CLEF (Cross-Language Evaluation Forum, at that time), a special 
aspect of the NTCIR CLIR tasks was the larger differences in language types between 
English and CJK. For example, nobody would deny that “linguistic distance” between 
English and Japanese is greater than that between English and Swedish. The special 
characteristics of CJK as languages may have contributed to unique modification or 
refinement of CLIR techniques (e.g., automatic transliteration). 

It is difficult to concisely present an overview of search performance attained by 
CLIR systems participating in CLIR tasks from NTCIR-3 to -6. Only the best perfor- 
mance of the SLIR and BLIR subtasks in NTCIR-6 is shown in Table 2.3 (Kishida 
et al. 2007), which provides the best MAP scores based on “rigid” relevance by 
each language combination. When comparing the MAP scores between monolin- 
gual searching (SLIR) and BLIR, it appears that BLIR to Japanese documents was 
more successful than to other languages because the percentages were 95.8% for C, 
82.1% for K and 94.4% for E search topics. However, the percentage highly depends 
on the system performance of the research group participating in the task at the time; 
thus, Table 2.3 does not indicate any research finding based on a scientific exami- 
nation. This table is only an example for superficially understanding an aspect of 
NTCIR CLIR tasks. Readers that are interested in the search runs of Table 2.3 can 
refer to Kishida et al. (2007) for more detail. 
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2.4.3 Multilingual Information Retrieval (MLIR) Techniques 


Two types of MLIR strategies are most commonly used: 


e (A) All documents and the query are translated into a single language (e.g., 
English), and then monolingual IR is executed thereafter and 

e (B) BLIR is repeated for all pairs of document language and query language, and 
then all search results are finally merged into a single ranked document list. 


Fewer research groups participated in MLIR subtasks compared to those in SLIR 
and BLIR subtasks, and most adopted the type B strategy. In the strategy, an important 
choice is how search results (actually, individual ranked lists by language pairs) are 
merged, which can be considered as a type of data fusion problem. The merging 
operation is also important for applications other than MLIR. 

Typical merging methods in NTCIR CLIR tasks are as follows. 


1. Round-robin merging: Documents are repeatedly selected from the top of each 
ranked list in a sequence. 

2. Raw score merging: All documents are merged and re-ranked according to doc- 
ument scores calculated by an IR model. 

3. Normalized score merging: Document scores that are calculated by an IR model 
are normalized before the documents are merged and re-ranked. 


When applying these methods, there are some difficulties. For example, if the 
number of relevant documents included in the C, J, K, and E components is signifi- 
cantly different, then the difference makes the MLIR more difficult. In this situation, 
an “absolute” relevance probability that is effective over all languages may have to 
be estimated for each document to achieve better performance. Braschler (2004) dis- 
cusses the other difficulties of MLIR. Actually, MAP scores of MLIR were typically 
lower than those of SLIR and BLIR in the NTCIR CLIR tasks. 


2.5 Concluding Remarks 


Research activity for exploring the cross-lingual ad hoc IR of newspaper articles in 
the NTCIR project ended at the CLIR task in NTCIR-6, for which the conference was 
held in May of 2007. Thereafter, during the 2010s, the Internet search engine perfor- 
mance remarkably improved, more easily allowing one to search Chinese, Japanese, 
and Korean documents in situations of monolingual IR. In addition, several excellent 
tools or resources for language processing have become available. Specifically, new 
technologies such as statistical machine translation or neural machine translation 
have drastically enhanced the effectiveness of MT. 

The current state of monolingual IR and language processing has largely changed 
from the time of the NTCIR CLIR tasks. Experimental findings that were obtained 
from the tasks have contributed to such technological advances and aided researchers 
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in developing a more sophisticated CLIR system based on the current technologies 
of monolingual IR and language processing. In addition, the authors believe that 
experience of constructing test collections that consist of comparable corpora in 
NTCIR CLIR tasks is useful for further development of IR theories and techniques 
in multilingual environments. 


Acknowledgements Many researchers in Taiwan, Japan, and Korea worked collaboratively as 
organizers in managing NTCIR CLIR tasks as follows: Hsin-Hsi Chen, Koji Eguchi, Noriko Kando, 
Kazuko Kuriyama, Hyeon Kim, Sukhoon Lee, and Sung Hyon Myaeng (as well as the authors of this 
paper). Additionally, in NTCIR-1 and -2, Toshihiko Nozue, Souichiro Hidaka, Hiroyuki Kato, and 
Masaharu Yoshioka also joined to organize the IR tasks. This paper attempted to summarize some 
aspects of valuable activities and efforts by the organizers and by all participants in the research 
tasks. 


References 


Ballesteros L, Croft WB (1997) Phrasal translation and query expansion techniques for cross- 
language information retrieval. In: Proceedings of the 20th annual international ACM SIGIR 
conference on research and development in information retrieval, pp 84-91 

Braschler M (2004) Combination approaches for multilingual text retrieval. Inf Retr 7(1/2):183—204 

Chen KH (2002) Evaluating Chinese text retrieval with multilingual queries. Knowl Organ 
29(3/4):156-170 

Chen KH, Chen HH, Kando N, Kuriyama K, Lee S, Myaeng SH, Kishida K, Eguchi K, Kim 
H (2002) Overview of CLIR task at the third NTCIR workshop. In: Proceedings of the Third 
NTCIR workshop on research in information retrieval, automatic text summarization and question 
answering 

Fujii A, Ishikawa T (2001) Japanese/English cross-language information retrieval: Exploration of 
query translation and transliteration. Comput Human 35(4):389-420 

Grefenstette G (1998) Cross-language information retrieval. Springer, Berlin 

Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H, Hidaka S (1999) Overview of IR tasks. 
In: Proceedings of the First NTCIR workshop on research in Japanese text retrieval and term 
recognition, pp 11-44 

Kando N, Kuriyama K, Yoshioka M (2001) Overview of Japanese and English information retrieval 
tasks (JEIR) at the second NTCIR workshop. In: Proceedings of the second NTCIR workshop 
on research in Chinese and Japanese text retrieval and text summarization 

Kishida K (2005) Technical issues of cross-language information retrieval: a review. Inf Process 
Manag 41(3):433-455 

Kishida K, Chen KH, Lee S, Kuriyama K, Kando N, Chen HH, Myaeng SH, Eguchi K (2004) 
Overview of CLIR task at the fourth NTCIR workshop. In: Proceedings of the fourth NTCIR 
workshop on research in information access technologies: information retrieval, question answer- 
ing and summarization 

Kishida K, Chen KH, Lee S, Kuriyama K, Kando N, Chen HH, Myaeng SH (2005) Overview of 
CLIR task at the fifth NTCIR workshop. In: Proceedings of the fifth NTCIR workshop meeting 
on evaluation of information access technologies: information retrieval, question answering and 
cross-lingual information access 

Kishida K, Chen KH, Lee S, Kuriyama K, Kando N, Chen HH (2007) Overview of CLIR task at 
the sixth NTCIR workshop. In: Proceedings of the 6th NTCIR workshop meeting on evaluation 
of information access technologies: information retrieval, question answering and cross-lingual 
information access 


2 Experiments on Cross-Language Information Retrieval ... 37 


Kuriyama K, Kando N, Nozue T, Eguchi K (2002) Pooling for a large-scale test collection: an 
analysis of the search results from the first NTCIR workshop. Inf Retr 5(1):41-59 

Nie JY (2010) Cross-language information retrieval. Morgan & Claypool Publishers 

Oard DW, Diekema AR (1998) Cross-language information retrieval. Ann Rev Inf Sci Technol 
33:223-256 

Peters C, Braschler M, Clough P (eds) (2012) Multilingual information retrieval. Springer, Berlin 

Sakai T (2020) Graded relevance. In: Sakai T, Oard DW, Kando N (eds) Evaluating information 
retrieval and access tasks. Springer, Singapore. The Information Retrieval Series, (in this book) 

Voorhees E, Harman DK (eds) (2005) TREC: experiment and evaluation in information retrieval. 
MIT Press 

Xu J, Weischedel R, Nguyen C (2001) Evaluating a probabilistic model for cross-lingual information 
retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research 
and development in information retrieval, pp 105—110 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative 
Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 3 A) 
Text Summarization Challenge: get 
An Evaluation Program for Text 
Summarization 


Hidetsugu Nanba, Tsutomu Hirao, Takahiro Fukushima, 
and Manabu Okumura 


Abstract In Japan, the Text Summarization Challenge (TSC), the first text sum- 
marization evaluation of its kind, was conducted in 2000-2001 as a part of the 
NTCIR (NII-NACSIS Test Collection for IR Systems) Workshop. The purpose of 
the workshop was to facilitate collecting and sharing text data for summarization by 
researchers in the field and to clarify the issues of evaluation measures for summa- 
rization of Japanese texts. After that, TSC has been held every 18 months as a part 
of the NTCIR project. In this chapter, we describe our TSC series, the data used, and 
the evaluation methods for each task, and the features of TSC evaluation. 


3.1 What is Text Summarization? 


The ever-growing amount of information forces us to read through a great num- 
ber of documents in order to extract relevant information from them. To cope with 
this situation, research on text summarization has attracted much attention recently, 
producing many studies in this field.! 


'Many survey papers are now available on text summarization, e.g., Gambhir and Gupta (2017), 
Allahyari et al. (2017), Nazari and Mahdavi (2019). 
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As research on text summarization is a hot topic in Natural Language Processing 
(NLP), we also see the needs to discuss and clarify issues of how to evaluate text 
summarization systems. In Japan, the Text Summarization Challenge (TSC), the first 
text summarization evaluation of its kind, was conducted in 1999-2000 as a part of 
the NTCIR (NI-NACSIS Test Collection for IR Systems) Workshop. The aim of TSC 
was to facilitate collecting and sharing text data for summarization by researchers 
in the field and to clarify the issues of evaluation measures for summarization of 
Japanese texts. 

Since that time, TSC? was held twice more, every 18 months, as a part of the 
NTCIR project. Multiple document summarization as one of the tasks was included 
for the first time at the TSC2 in 2002. 

As we mention in Sect. 3.5, the contributions of our TSC can be considered as 
follows: 


e We proposed a new evaluation method, evaluation by revision, that evaluates sum- 
maries by measuring the degree of revisions of the system results. 

e We proposed a new evaluation method for multiple documents summarization 
that enables us to measure the effectiveness of redundant sentence reduction in the 
systems. 


In the following sections, we first introduce the types of summarization and the 
evaluation methods in general. Then, we describe our TSC series, the data used, and 
the evaluation methods for each task. Finally, we summarize the contributions of the 
TSC evaluations. 


3.2 Various Types of Summaries 


Text summarization is a task of producing a shorter text from the source, while keep- 
ing the information content of the source. Summaries are the results of such a task. 
Perhaps, one of the most widely used summaries in the world today is the snippets 
that Web search engines display for each Web page. Sparck Jones (1999) discussed 
several ways to classify summaries. The following three factors are considered to be 
important for text summarization research: 


Input factors: text length, genre, and single versus multiple documents, 
Purpose factors: who the user is, and the purpose of summarization, 
Output factors: running text or headed text, etc. 


Summaries can be classified with respect to the number of the source texts (single 
document versus multiple document summarization), and with respect to whether 
they are tailored to particular users. Early research in summarization was primarily 
based on single-document summarization, in which systems produced a summary 
from a single-source document. However, another task has been later introduced into 


*http://www.Ir.pi.titech.ac.jp/tsc/index-en.html. 
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text summarization, that is based on multiple source documents. In multi-document 
summarization, several documents sharing a similar topic are taken as the input. The 
task of multi-document summarization can be considered more difficult than the 
single-document one, because the systems would need to remove any redundancies 
across multiple documents and then make the contents from multiple documents into 
a coherent summary. 

If summaries are targeted for specific users, they are called user focused, and if they 
are intended for users in general, they are called generic. Query-focused summaries 
are another name for user focused summaries. In query-based summarization, the 
summary is generated by selecting sentences that correspond to the user’s query 
(Tombros and Sanderson 1998). Sentences that are relevant to the query have a higher 
chance to be extracted for the final summary. In terms of summarization purpose, 
summaries can be either indicative or informative. Users can make use of indicative 
summaries before referring to the source, e.g., to judge relevance of the source text. 
On the other hand, users may use summaries in place of the source text (informative 
summaries). The snippets of Web search engines are a good example of indicative 
and query-focused summaries. 

As pointed out by Mani and Maybury (1999), summaries can be also classified 
into extracts and abstracts, depending on how they are composed. Conventional text 
summarization systems produce summaries by using sentences or paragraphs as a 
basic unit, giving them a degree of importance, sorting them based on the importance, 
and gathering the important sentences. In short, summaries that are constructed of 
a set of important sentences extracted from the source text are called extracts. In 
contrast, summaries that may contain newly produced texts are called abstracts. 
Therefore, abstractive summarization can be much more complex than extractive 
summarization. 


3.3 Evaluation Metrics for Text Summarization 


Evaluation methods for text summarization can be largely divided into two categories: 
intrinsic and extrinsic. The quality of summaries can be judged directly based on some 
norms; typically, ideal summaries are produced by hand, or important sentences are 
selected by hand. Then, the quality of summaries is evaluated by comparing them 
with the human-produced summaries (intrinsic evaluation). The quality of a summary 
can also be judged by measuring how it influences the achievement of some other task 
(extrinsic evaluation). Mani and Maybury (1999) stated such tasks can be question- 
answering, reading comprehension, as well as relevance judgement of a document 
to a certain topic indicated by a query. 


Relevance judgement: determines whether it is possible to judge whether the pre- 
sented document is relevant to a user’s topic, that can be indicated by her query, 
by reading the summary. 
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Reading comprehension: determines whether it is possible to correctly complete 
a multiple-choice test after reading the summary. 


There are two measures for intrinsic evaluation: Quality and informativeness 
(Gambhir and Gupta 2017). The first measure checks the summary for grammatical 
errors, redundant information, and structural coherence. Here, the linguistic aspects 
of the summary are considered. In the Document Understanding Conference (DUC) 
and Text Analysis Conference (TAC), five questions based on linguistic quality are 
employed for evaluating summaries, which are non-redundancy, focus, grammat- 
icality, referential clarity, and structure and coherence. Human assessors evaluate 
summaries manually by assigning a score to the summary, on a five-point scale. 

For intrinsically evaluating the informativeness of a summary, the most popu- 
lar metrics are precision, recall, and F-measure; they measure the overlap between 
human-made summaries and automatically generated machine-made summaries. 


Precision: determines what fraction of the sentences selected by the system are 
correct. 

Recall: determines what proportion of the sentences chosen by humans are selected 
by the system. 

F-measure: is computed by combining recall and precision. 


3.4 Text Summarization Evaluation Campaigns Before 
TSC 


The first conference where text summarization systems were evaluated was held at 
the end of the 90’s and was named the TIPSTER Text Summarization Evaluation 
(SUMMAC) (Mani and Maybury 1999). At that time, text summaries were evaluated 
using two extrinsic and one intrinsic methods. Two main extrinsic evaluation tasks 
were defined: adhoc and categorization. In the adhoc task, the focus was on indicative 
summaries which were tailored to a particular topic, and they were used for relevance 
judgement. In the categorization task, the evaluation sought to find out whether a 
generic summary could effectively present enough information to allow people to 
quickly and correctly categorize a document. The final task, a question-answering 
task, involved an intrinsic evaluation, where a topic-related summary for a document 
was evaluated in terms of its “informativeness”’. 

Another important conference for text summarization was DUC, which was held 
every year from 2001 to 2007 (Gambhir and Gupta 2017). All editions of this con- 
ference contained newswire documents. Initially, in DUC-2001 and DUC-2002, the 
tasks involved generic summarization of single and multiple documents; they later 
extended to query-based summarization of multiple documents in DUC-2003. In 
DUC-2004, topic-based single and multi-document cross-lingual summaries were 
evaluated. 
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3.5 TSC: Our Challenge 


Another evaluation program, NTCIR, formed a series of three Text Summariza- 
tion Challenge (TSC) workshops—TSC1 in NTCIR-2 from 2000 to 2001, TSC2 
in NTCIR-3 from 2001 to 2002, and TSC3 in NTCIR-4 from 2003 to 2004. These 
workshops incorporated summarization tasks for Japanese texts. The evaluation was 
done using both extrinsic and intrinsic evaluation methods. 


3.5.1 TSC1 


In TSC1, newspaper articles were used, and two tasks for a single article with intrinsic 
and extrinsic evaluations were performed (Fukushima and Okumura 2001; Nanba and 
Okumura 2002). We used newspaper articles from the Mainichi newspaper database 
of 1994, 1995, and 1998. The first task (Task A) was to produce summaries (extracts 
and free summaries) for intrinsic evaluation. We used recall, precision, and F-measure 
for evaluation of the extracts, and content-based as well as subjective methods for the 
evaluation of free summaries. The second task (Task B) was to produce summaries 
for the information retrieval task. The measures for evaluation were recall, precision, 
and F-measure for the correctness of the task, as well as the time that it takes to 
carry out the task. We also prepared human-produced summaries for the evaluation. 
In terms of genre, we used editorials and business news articles in the TSC1 dry-run 
evaluation, and editorials and articles on social issues in the formal run evaluation. 
As shareable data, we gathered summaries not only for the TSC evaluation but 
also for the researchers to share. By spring 2001, we collected summaries of 180 
newspaper articles. For each article, we had the following seven types of summaries: 
important sentences (10, 30, 50%), summaries created by extracting important parts 
in sentences (20, 40%), and free summaries (20, 40%). 

The basic evaluation design of TSC1 was similar to that of SUMMAC. The dif- 
ferences were as follows: 


e As the intrinsic evaluation in Task A, we used a ranking method in subjective 
evaluation for four different summaries (baseline system results, system results, 
and two kinds of human summaries). 

e Task B was basically the same as one of the SUMMAC extrinsic evaluations (the 
adhoc task), except the documents were in Japanese. 


The following points were some of the features of TSC1. For Task A, we used several 
summarization rates and prepared the texts of various lengths and genres to use for 
evaluations. Their lengths varied at 600, 900, 1200, and 2400 characters, and the 
genres included business news, social issues, as well as editorials. As for Task A, 
because it was difficult to perform intrinsic evaluation on informative summaries, we 
presented the evaluation results as materials for discussions, at NTCIR workshop 2. 
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3.5.2 TSC2 


TSC2 had two tasks (Okumura et al. 2003): single-document summarization (Task 
A) and multi-document summarization (Task B). In Task A, we asked the participants 
to produce summaries in plain text to be compared with human-prepared summaries 
from single texts. This task was the same as Task A in TSC1. In Task B, more than 
one (multiple) texts were summarized for the task. Given a set of texts, which has 
been manually gathered for a pre-defined topic, the participants produced summaries 
of the set in plain text format. The information that was used to produce the document 
set such as queries and summarization lengths were also given to the participants. 

We used newspaper articles of the Mainichi newspaper database from 1998 and 
1999. As the gold standard (human prepared summaries), we prepared the following 
types of summaries: 


Extract-type summaries: We asked annotators, captioners who were well experi- 
enced in summarization, to select important sentences from each article. 

Abstract-type summaries: We asked the annotators to summarize the original arti- 
cles in two ways. First, to choose important parts of the sentences in extract-type 
summaries. Second, to summarize the original articles freely without worrying 
about sentence boundaries and trying to obtain the main ideas of the articles. 
Both types of abstract-type summaries were used for Task A. Both extract-type 
and abstract-type summaries were made from single articles. 

Summaries from more than one article: Given a set of newspaper articles that has 
been selected based on a certain topic, the annotators produced free summaries 
(short and long summaries) for the set. Topics varied from a kidnapping case to 
the Y2K problem. 


We used summaries prepared by humans for evaluation. The same two intrinsic 
evaluation methods were used for both tasks. They were evaluated by ranking the 
summaries and by measuring the degree of revisions. 


Evaluation by ranking: This is basically the same method as the one we used for 
Task A in TSC1 (subjective evaluation). We asked human judges, who are experi- 
enced in producing summaries, to evaluate and rank the system summaries from 
two points of views: 


1. Content: How much the system summary covers the important content of the 
original article? 
2. Readability: How readable the system summary is? 


Evaluation by revision: It was a newly introduced evaluation method in TSC2 to 
evaluate summaries by measuring the degree of revisions of the system results. 
The judges read the original texts and revised the system summaries in terms of 
content and readability. The revisions were made by only three editing operations 
(insertion, deletion, and replacement). The degree of the human revisions, which 
we call “edit distance”, was computed from the number of revised characters 
divided by the number of characters in the original summary. As a baseline for 
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Task A, human-produced summaries, as well as lead-method results, were used. 
Also, as a baseline for Task B, human-produced summaries, lead-method results, 
and the results based on the Stein method (Stein et al. 1999) were used. The lead- 
method extracts a few first sentences of news articles. The procedure of the Stein 
method is roughly as follows: 


1. Produce a summary for each document. 

Group the summaries into several clusters. The number of clusters is adjusted 

to be less than the half of the number of the documents. 

Choose the most representative summary as the summary of the cluster. 

4. Compute the similarity among the clusters and output the representative sum- 
maries in such order that the similarity of neighboring summaries is high. 


Ww 


We compared the evaluation by revision with the ranking evaluation, which is a 
manual method used in both TSC1 and TSC2. To investigate how well the evaluation 
measure recognizes slight differences in the quality of the summaries, we calculated 
the percentage of cases where the order of edit distance of two summaries matched 
the order of their ranks given by the ranking evaluation by checking the score from 0 
to 1 at 0.1 intervals. As a result, we found that the evaluation by revision is effective 
for recognizing slight differences between computer-produced summaries (Nanba 
and Okumura 2004). 


3.5.3 TSC3 


In a single document, there are few sentences with the same content. In contrast, 
in multiple documents with multiple sources, there are many sentences that convey 
the same content with different words and phrases, or even with identical sentences. 
Thus, a text summarization system needs to recognize such redundant sentences and 
reduce this redundancy in the output summary. 

However, we have no ways of measuring the effectiveness of such methods of 
reducing redundancy in the corpora for DUC and TSC2. The gold standard in TSC2 
was given as abstracts (free summaries) with the number of characters less than a 
fixed number. It was therefore difficult to use for repeated or automatic evaluation 
and for the extraction of important sentences. Moreover, in DUC, where most of the 
gold standard was abstracts with the number of words less than a fixed number, the 
situation was the same as in TSC2. At DUC 2002, extracts (important sentences) 
were used, and this allowed us to evaluate sentence extraction. However, it was not 
possible to measure the effectiveness of redundant sentence reduction because the 
corpus was not annotated to show sentences with the same content. 

Because many of the current summarization systems for multiple documents were 
based on sentence extraction, in TSC3, we assumed that the process of multiple 
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document summarization should consists of the following three steps. We produced 
a corpus for evaluating the system at each of these three steps? (Hirao et al. 2004). 


Step 1 Extract important sentences from a given set of documents, 

Step 2 Minimize redundant sentences from the results of Step 1, 

Step 3 Rewrite the results of Step 2 to reduce the size of the summary to the 
specified number of characters or less. 


We have annotated not only the important sentences in the document set, but also 
those among them that have the same content. These are the corpora for Steps 1 
and 2. We have prepared human-produced free summaries (abstracts) for Step 3. 
We constructed extracts and abstracts of thirty sets of documents drawn from the 
Mainichi and Yomiuri newspapers published between 1998 to 1999, each of which 
was related to a certain topic. 

In TSC3, because we had the gold standard (a set of correct important sentences) 
for Steps 1 and 2, we conducted automatic evaluation using a scoring program. We 
adopted intrinsic evaluation by human judges for Step 3. Therefore, we used the 
following intrinsic and extrinsic evaluation. The intrinsic metrics were “Precision”, 
“Coverage”, and “Weighted Coverage.” The extrinsic metric was “Pseudo Question- 
Answering,” i.e., whether a summary has an “answer” to the question or not. The 
evaluation was inspired by the question-answering task in SUMMAC. Please refer 
to Hirao et al. (2004) for more details of the intrinsic metrics. 


3.6 Text Summarization Evaluation Campaigns After TSC 


In DUC-2005 and DUC-2006, multi-document query-based summaries were eval- 
uated whereas in DUC-2007, multi-document update query-based summaries were 
evaluated. These conferences also provided standard corpora of documents and gold 
summaries. 

After 2007, DUC was succeeded by TAC, in which summarization tracks were 
presented (Gambhir and Gupta 2017). The 2008 summarization track consisted of 
two tasks: update task and opinion pilot. The update summarization task aimed to 
produce a short summary (around 100 words) from a collection of news articles, 
assuming that the user has already gone through a collection of previous articles. 
The opinion pilot task aimed to produce summaries of opinions from blogs. The 
2009 summarization track had two tasks: update summarization, which was the same 
as in 2008, and Automatically Evaluating Summaries of Peers (AESOP). AESOP 
was a new task that was introduced in 2009; AESOP computes a summary’s score 
with respect to a particular metric that is related to the summary’s content, such as 
overall responsiveness and pyramid scores. The 2010 summarization track had two 
tasks: guided summarization and AESOP. The guided summarization task aimed 


3This is based on a general idea of a summarization system and is not intended to impose any 
conditions on a summarization system. 
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to generate a 100-word summary from a collection of 10 news articles pertaining 
to a specific topic; each topic belongs to a previously defined category. The 2011 
summarization track consisted of three tasks: guided summarization, AESOP, and 
multilingual pilot. 


3.7 Future Perspectives 


We described our TSC series, the data used, and the evaluation methods for each task, 
and the features of TSC evaluation. As we mentioned in Sect. 3.5, the contributions 
of our TSC can be considered as follows: 


e We proposed a new evaluation method, evaluation by revision, that evaluates sum- 
maries by measuring the degree of revisions of the system results. 

e We proposed a new evaluation method for multiple document summarization that 
enables us to measure the effectiveness of redundant sentence reduction in the 
systems. 


More than 15 years have passed since our last evaluation challenge. Today, the 
text summarization field has changed a lot in that a huge amount of summarization 
data is now available in the field and neural models have prevailed and dominated 
the field. While we now have a variety of large summarization datasets such as 
Gigaword Corpus, New York Times Annotated Corpus, CNN/Daily Mail dataset, 
and NEWSROOM dataset (Grusky et al. 2018), it becomes difficult to compare 
systems on the datasets, against our expectations, because we do not necessarily have 
a standard dataset to compare them with. Even for the same dataset, the performance 
might change depending on differently sampled test data. Therefore, we can say that 
the current evaluation of summarization systems might not necessarily be reliable. 
In the future, we should construct a good standard dataset, against which we could 
compare summarization systems. For this purpose, it is necessary to investigate the 
properties of a variety of datasets that will enable us to sample test data to create a 
good evaluation dataset. 
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Challenges in Patent Information rie 
Retrieval 


Makoto Iwayama, Atsushi Fujii, and Hidetsugu Nanba 


Abstract We organized tasks on patent information retrieval during the decade 
from NTCIR-3 to NTCIR-8. All of the tasks were ones that reflected real needs 
of professional patent searchers and used large numbers of patent documents. This 
chapter describes the designs of the tasks, the details of the test collections, and the 
challenges addressed in the research field of patent information retrieval. 


4.1 Introduction 


A patent for an invention is a grant for the inventor to exclusively exploit the invention 
in the limited term in return to disclosing it to the public. The invention is described 
in a document called a patent application (also called a patent specification or an 
application document), which is composed of an abstract and sections describing the 
scope of the invention (the claims), the problems to be solved, the embodiments of 
the invention, etc. The patent application is filed with the patent office. The date of 
filing is called the filing date or application date. After the filing, the patent office 
examines the patent application, and if the invention is judged to be novel, in other 
words, one which has no prior art, a patent is granted for it. 

As the economy grows worldwide, the number of patent applications and grants 
has also grown. The World Intellectual Property Organization (WIPO) announced 
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that the number of patent applications in 2017 had exceeded three million. In Japan, 
about three-hundred-thousand patent applications are filed every year. Since patent 
applications are highly technical and their length tends to be long, the task of search- 
ing for patent applications poses many issues in relation to information retrieval; 
similar situations are searching for technical papers or searching for legal documents. 

In this chapter, we introduce the challenges aimed at addressing the issues of 
patent information retrieval.' These challenges were formulated as tasks performed 
in NTCIR workshops from 2001 to 2010. The NTCIR tasks were designed on the 
basis of actual patent-related work involving a large number of patent applications. 
The remainder of this chapter is organized as follows. Section 4.2 briefly introduces 
the NTCIR tasks. Section 4.3 describes the tasks in detail, including the search topics, 
document collections, submissions, relevance judgements, evaluation measures, and 
participants. Finally, Sect.4.4 summarizes NTCIR’s contributions to the research 
activities on patent information retrieval. 


4.2 Overview of NTCIR Tasks 


4.2.1 Technology Survey 


Managers, researchers, and developers often want to know whether there are existing 
inventions related to the products they are planning to develop. This situation is sim- 
ilar to when researchers survey research papers before embarking on new research. 

To satisfy this information need, they have to conduct a “technology survey” 
that involves searching for relevant patent applications published so far. Here the 
query might not be described in patent-specific terms, because the searcher is not 
always familiar with the procedure for searching for patents. Moreover, the notion 
of relevance is not patent-specific. Patent applications are treated like technological 
articles such as research papers. 


4.2.2 Invalidity Search 


After inventing a new method, device, material, etc., the inventor describes the inven- 
tion in a patent application and sends it to the patent office. The patent office then 
examines the application to see if there is prior art which invalidates the invention 
by searching for patent applications filed before the filing date. This is called a 
“invalidity search” or “prior art search”. Invalidity searches are also conducted by 
applicants themselves, because they should be confident of their inventions being 
granted patents before they make their applications. 


‘Readers who are interested in patent machine translation can refer to Chap. 7. 
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The invalidity search is patent-specific work. A searcher should be able to under- 
stand the components of an invention in accordance with the claims described in 
the application. Relevance is assessed based on the novelty or the invalidity of the 
invention. The searcher compares each component of the invention with portions of 
each retrieved document to see if they describe the same invention. If there is no prior 
art which can invalidate the novelty of the invention, a patent is granted; otherwise, 
the application is rejected. In most cases of rejection, several instances of prior art 
are cited, each of which corresponds to a component of the described invention. 


4.2.3 Classification 


Classification codes are extensively used when searching to narrow down the relevant 
applications. The patent office assigns each patent application appropriate classifi- 
cation codes before it is published. Human experts have to expend much effort to 
make this assignment, and for this reason, a (semi-)automatic method is desired. 
The most popular classification codes for patents are the International Patent 
Classification (IPC) codes which are used worldwide. The Japan Patent Office (JPO) 
additionally uses and maintains a list of F-terms (File forming terms). F-terms are 
facet-oriented classification codes, and a patent application is classified from a variety 
of facets (viewpoints) such as objective, application, structure, purpose, and means. 
In NTCIR, patent applications are automatically classified with F-terms in accor- 
dance with the behavior of human experts who perform their classification work 
in two steps. The first step is the theme (topic) classification, assigning a patent 
application to technological themes. Each theme corresponds to a group of IPC 
“sub-classes”. The number of themes is about 2,500. The second step is F-term clas- 
sification, i.e., assigning F-terms to an application that has already been assigned 
themes in the first step. Although the total number of F-terms is huge, over 300,000, 
the number of F-terms within each theme is relatively small, about 130 on average. 


4.2.4 Mining 


For a researcher in a field with high industrial relevance, analyzing research papers 
and patents has become an important aspect of assessing the scope of their field. 
The JPO creates patent application technical trend surveys for fields in which the 
development of technologies is expected, or fields to which social attention is being 
paid. However, it is costly and quite time-consuming. 

In NTCIR, we aimed to construct a technical trend map from research papers and 
patents in a specific field. For the construction of the map, we focused on the elemental 
(underlying) technologies used in a particular field and their effects. Knowledge of 
the history and effects of the elemental technologies used in a particular field is 
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important for grasping the outline of technical trends in the field. Therefore, we 
designed the task to extract elemental technologies and their effects from research 
papers and patents. 


4.3 Outline of the NTCIR Tasks 


4.3.1 Technology Survey Task: NTCIR-3 


Table 4.1 summarizes the test collections of the NTCIR tasks for patent retrieval. 

A technology survey task was performed at only NTCIR-3 (Iwayama et al. 2003). 
Since this task was our first attempt to handle a practical number of patent applica- 
tions, we designed the task to be as close as possible to the ad hoc retrieval tasks in 
TREC, except that the targeting documents were patent applications. 

To launch the task, we obtained the cooperation of members of the Japan Intel- 
lectual Property Association (JIPA), who are experts in patent searches. Each JIPA 
member belongs to the intellectual property division in the company he or she works 
for. We collaborated with them in designing our tasks, constructing search topics, col- 
lecting relevant documents, evaluating the submitted results, and many other ways. 
Our collaboration continued through to NTCIR-4, and this was a major reason for 
the success of our challenges. 


4.3.1.1 Search Topics 


The technology survey task assumed a situation where a searcher is interested in 
a technology, for example, a “blue light-emitting diode”, described in a newspaper 


Table 4.1 NTCIR test collections for patent retrieval 


Task NTCIR-3 NTCIR-4 NTCIR-5 | NTCIR-6 
Technology | Invalidity search 
survey 
Main Additional English 
Search topics 31(+6) 34(+7) 2221(+1000) 
ja,en,zh- ja,en,zh- | ja en 
CN,zh-TW | CN 
Document | Patent 1998-1999 | 1993-1997 ja 1993-2002 ja 1993-2000 
collections} applica- | ja en 
tions 
Abstracts | 1995-1999 | 1993-1997 en 1993-2002 en - 
ja,en 


Relevance judgments | Manual Citation 
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article. The JIPA members constructed 31 search topics from newspaper articles 
which, in most cases, were selected from the topics they were working on in their 
daily jobs. Each search topic contained a title, headline, and text of the article that 
triggered the request for information, a description and a narrative of the topic, a set 
of concepts (keywords) related to the topic, and a supplement with more information 
about the relevance. All the search topics were translated into English, Korean, and 
Chinese (simplified and traditional). 


4.3.1.2 Document Collections 


The main documents for the retrieval were Japanese full texts of (unexamined) patent 
applications published in 1998 and 1999. The number of documents was 697,262. 
We also released abstracts of patent applications published over the 1995-1999 
period, in Japanese and in English. The English abstracts were translations from the 
Japanese ones. The number of documents was 1,706,154 for the Japanese abstracts 
and 1,701,339 for the English abstracts. Some of the Japanese abstracts did not have 
corresponding English abstracts. 


4.3.1.3 Submissions 


Each participant submitted at least one run that used only the newspaper articles 
and supplements on the given search topics. In addition, we recommended that they 
submit ad hoc runs that used the descriptions and the narratives. For each search 
topic, a ranked list of at most 1000 patent applications was submitted in decreasing 
order of relevance score. 


4.3.1.4 Relevance Judgments 


The relevance of the technology survey is not patent-specific; that is, it is not based 
on the novelty of the invention, but rather on the relatedness of the search topic to 
the patent application. 

The relevance was assessed by JIPA members in two steps. In the first step, the 
JIPA member who created the topic collected relevant documents on the topic before 
its release. Here although the members were allowed to use any search tools, almost 
all of them used Boolean ones, despite the fact that the organizers had provided a 
rank-based search system. In the second step, after the participants submitted their 
results for a topic, the JIPA member who created the topic judged the relevance of the 
unseen documents in the pool collected from the top-ranked submitted documents. 
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4.3.1.5 Evaluation 


The submitted runs were evaluated by comparing recall/precision trade-off curves 
and values of mean average precision (MAP). 


4.3.1.6 Participants 


Eight groups submitted 36 runs. The top-performing run was from Ricoh (Itoh et al. 
2003). They focused on re-weighting terms based on their statistics in the differ- 
ent collections (patent applications vs. newspaper articles). For example, the query 
term “president” in a newspaper article might not be effective for retrieving relevant 
patent applications. However the inverse document frequency (IDF)-based weight- 
ing gives this term a large weight, because it occurs rarely in patent applications. 
Their approach, called “term distillation”, involved multiplying the weights in the 
query (i.e., newspaper articles) and the target documents (i.e., patent applications) 
to select effective terms from a query newspaper article. 


4.3.2 Invalidity Search Task: NTCIR-4, NTCIR-5, and 
NTCIR-6 


Having gained experience in the technology survey task at NTCIR-3, we moved on 
to invalidity search, which is a patent-specific search. Invalidity search tasks were 
performed in NTCIR-4 (Fujii et al. 2004), NTCIR-5 (Fujii et al. 2005), and NTCIR- 
6 (Fujii et al. 2007b). 


4.3.2.1 Search Topics 


Invalidity searches are searches, for instances, of prior art that could invalidate a 
patent application. Here, the patent application itself becomes a search topic. As 
search topics in NTCIR-4, JIPA members selected 34 Japanese patent applications 
that had been rejected by JPO. We called this set “NTCIR-4 main topics”. 
Regarding the NTCIR4 main topics, relevant documents were thoroughly col- 
lected by the JIPA members (see Sect.4.3.2.4 for the details of this collection 
procedure). However, we found that the number of relevant documents in invalid- 
ity searches was small compared with the existing test collections for information 
retrieval. Consequently, evaluations made on a small number of topics could poten- 
tially be inaccurate. To increase the number of search topics, JIPA members selected 
an additional 69 search topics from other rejected patent applications. Here, relevant 
documents were only the citations reported by JPO. We called this set “NTCIR-4 
additional topics”. Note that the major difference between these two sets relates to 
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the completeness of relevant documents. We will discuss the effect of this issue on 
the evaluation in Sect. 4.3.2.6. 

The NTCIR-4 main topics set had two more resources with which to make addi- 
tional evaluations. First, each claim was translated into English and simplified Chi- 
nese for evaluating cross-language retrieval. Second, each Japanese claim had anno- 
tations for the components of the invention. Components were, for example, parts of a 
machine or substances of a chemical compound. Some participants used component 
information in the task. 

In both NTCIR-5 and NTCIR-6, the organizers increased the number of search 
topics by following the same method used to create NTCIR-4 additional topics. 
Accordingly, the relevant documents for these topics were only citations. The number 
of search topics was 1,189 in NTCIR-5 and 1,685 in NTCIR-6. We called these topic 
sets “NTCIR-5 main topics” and “NTCIR-6 main topics”. 

NTCIR-5 included a passage retrieval task as a sub-task of the invalidity search 
task. Since patent applications are lengthy, it is useful to point out significant frag- 
ments (“passages”) in a relevant application. In the passage retrieval task, a relevant 
application retrieved from a search topic was given, and the purpose was to identify 
the relevant passages in the relevant application. We used 378 relevant applications 
obtained from 34 search topics of NTCIR-4 main topics plus another 6 that had been 
used in the dry run in NTCIR-4. 

NTCIR-6 involved an invalidity search task on English patent applications (called 
the “English retrieval task”). The design of the task was the same as the Japanese one. 
Each search topic was a patent application published by the United States Patent and 
Trademark Office (USPTO) in 2000 or 2001. We collected 3,221 search topics (1,000 
for the dry run and 2,221 for the formal run) from those satisfying the two conditions; 
first, at least 20 citations are listed, and second, at least 90% of the citations are 
included in the target document collection. These citations were relevant documents. 


4.3.2.2 Document Collections 


In NTCIR-4, the document collection for the target of searching consisted of 5 years’ 
worth of Japanese (unexamined) patent applications published from 1993 to 1997. 
The number of documents totaled 1.7 million. We additionally released English 
abstracts that were translations of the Japanese abstracts in these applications. 

In NTCIR-5 and NTCIR-6, the document collections (both Japanese patent appli- 
cations and English abstracts) were enlarged to include those published over the 10 
year period from 1993 to 2002. The number of documents in each collection was 3.5 
million. 

In the English retrieval task at NTCIR-6, the document collection was patent 
applications published from USPTO over the period from 1993 to 2000. The number 
of documents totaled about | million. 
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4.3.2.3 Submissions 


Although the full texts of the patent applications were provided as search topics, each 
participant was requested to submit a result which only used the claims and the filing 
dates in the search topics. Participants could submit additional results, in which they 
could use any information in the search topics. The number of documents retrieved 
for each search topic was 1,000 at maximum, and these documents were submitted 
in decreasing order of relevance score. 

To assess effectiveness across different sets of search topics, each participant was 
requested to submit a set of results from all the Japanese main topics released so far. 
For example, in NTCIR-6, each participant had to submit runs using NTCIR-4 main 
topics and NTCIR-5 main topics in addition to NTCIR-6 main topics. 

In the passage retrieval task in NTCIR-5, each participant was requested to sort all 
passages in each of the given relevant applications according to the degree to which 
a passage provided grounds to judge if the application was relevant. 


4.3.2.4 Relevance Judgments 


In invalidity searches, the most reliable relevant documents are ones cited by the 
patent office when rejecting patent applications. However, we were not confident 
that using only citations would be enough to evaluate the participating systems from 
the standpoint of recall. Therefore, in NTCIR-4, we exhaustively collected relevant 
documents by performing the same two steps that were used in the technology survey 
task of NTCIR-3. 

First, the JIPA members who created the search topics (NTCIR-4 main topics) 
performed manual searches to collect as many relevant documents as possible. Cita- 
tions from the topic applications were included among the relevant documents. We 
allowed the JIPA members to use any system or resource to find relevant documents. 
In this way, we would obtain a relevant document set under the circumstances of 
their daily patent searches. Most members used Boolean searches, which to this day 
remains the most popular method used in invalidity searches. Second, after the par- 
ticipants submitted their runs, the JIPA members judged the relevance for the unseen 
documents in the pool collected from the top-ranked documents in each run. Here, 
one promising result was that the participating systems could find a relatively large 
number of relevant documents which were neither citations nor relevant documents 
found by the JIPA members in the first step. 

In NTCIR-5 and NTCIR-6, we used only citations as relevant documents, mainly 
because we could not cooperate with expert searchers. 

Relevance was automatically graded as relevant, partially relevant, or irrelevant. A 
document that could solely be used to reject an application was regarded as relevant. 
A document that could be used with other documents to reject an application was 
regarded as partially relevant. Other documents were regarded as irrelevant. 

In NTCIR-6, we tried an alternative definition of the relevance grade, one based 
on the observation that if a search topic and its relevant application have the same IPC 
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codes, systems could easily retrieve the relevant application by using IPC codes as 
filters. We divided the relevant documents into three classes according to the number 
of shared IPC codes between the topic and a relevant document and compared the 
submitted runs on the basis of the classes (Fujii et al. 2007b). 

In the passage retrieval task of NTCIR-5, we reused the search topics in NTCIR- 
4 and all the relevant passages had been collected in NTCIR-4 by JIPA members. 
Relevance was graded as follows. If a single passage could be grounds to judge 
the target document as relevant or partially relevant, this passage was judged to be 
relevant. If a group of passages could be grounds, each passage in the group was 
judged as partially relevant. Otherwise, the passage was judged as irrelevant. 


4.3.2.5 Evaluation 


We used MAP for the evaluation measure in all the invalidity search tasks. In the 
passage retrieval task, we additionally used the averaged passage rank at which an 
assessor obtains sufficient grounds to judge whether a target document is relevant 
or partially relevant, when the assessor checked the passages in the top-ranked to 
bottom-ranked target documents. 


4.3.2.6 Participants 


Eight groups participated in the invalidity search tasks of NTCIR-4, ten in NTCIR-5, 
and five in NTCIR-6. The passage retrieval task of NTCIR-S had four groups, while 
the English retrieval task of NTCIR-6 had five groups. In this section, we introduce 
only those groups who participated in most of the main tasks, i.e., document-level 
invalidity searches using Japanese topics. 

Hitachi submitted runs to all of the invalidity search tasks (Mase et al. 2004, 2005; 
Mase and Iwayama 2007). From NTCIR-4 to NTCIR-6, they tried various methods, 
for example, using stop words, filtering by IPC codes, term re-weighting, or using the 
claim’s structure. The methods were composed of two-step searches. The first step 
was a recall-oriented search, and the second step was a re-ranking of the documents 
retrieved by the first step to improve precision. 

NTT Data participated in NTCIR-4 (Konishi et al. 2004) and NTCIR-5 (Konishi 
2005). They expanded the query terms with keywords selected from the “detailed 
descriptions of the invention” (“embodiments”) section. First, they decomposed a 
topic claim into components of the invention by using pattern-matching rules. Next, 
they identified descriptions that explain each component by using another set of 
pattern-matching rules. Lastly, they added keywords in the descriptions to the query 
terms. 

The University of Tsukuba participated in all of the invalidity search tasks (Fujii 
and Ishikawa 2004, 2005; Fujii 2007). They automatically decomposed a topic claim 
into components and searched the components independently. Then, they integrated 
the results. Query terms were also extracted from related passages automatically 
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identified in the topic document. Retrieved documents which did not share the IPC 
codes of the query application were filtered out. They observed that the IPC filtering 
was more effective in NTCIR-5 main topics than in NTCIR-4 main topics (Fujii 
and Ishikawa 2005; Fujii 2007). This difference might have been due to the nature 
of the relevance of the two sets. The relevant documents for NTCIR-4 main topics 
were manually collected, while those for NTCIR-5 main topics were only citations. 
If we imagine that searches at a patent office often rely on metadata (IPC codes), we 
could further assume that citations by the patent office might be retrieved by the IPC 
filtering. This hypothesis became a motivation for NTCIR-6 to divide up the relevant 
documents according to the number of shared IPC codes with a search topic (Fujii 
et al. 2007b). 

Ricoh used the IPC codes for both filtering and pseudo-relevance feedback (Itoh 
2004, 2005). In the latter usage, they first retrieved documents and extracted IPC 
codes from the top-ranked documents; then, they filtered out the retrieved documents 
which did not share any of the extracted IPC codes. 


4.3.3 Patent Classification Task: NTCIR-5, NTCIR-6 


4.3.3.1 Data Collections 


Patent classification tasks were performed in NTCIR-5 (Iwayama et al. 2005) and 
NTCIR-6 (Fujii et al. 2007b). Table 4.2 summarizes the test collections of the NTCIR 
patent classification tasks. 

The training documents in NTCIR-5 and NTCIR-6 consisted of Japanese (unex- 
amined) patent applications published during 1993—1997 and their English abstracts. 
Themes and F-terms for these documents were also released. As for test documents 
in NTCIR-5, 2,008 documents were released for the theme classification task, while 
500 were released for the F-term classification task. The documents were selected 
from Japanese (unexamined) patent applications published in 1998 and 1999. Five 
themes were selected in the F-term classification task. 

In NTCIR-6, only the F-term classification task was performed. We increased the 
number of themes to 108, and the test documents to 21,606. 


Table 4.2 NTCIR test collections for patent classification 


Task NTCIR-5 NTCIR-6 
Theme F-term F-term 
Test documents 2,008 2,562 (5 themes) | 21,606 (108 
themes) 
Training Patent 1993-2002 ja 
documents applications 


Abstracts 1993-2002 en 
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4.3.3.2 Submissions 


Each participant in NTCIR-5 submitted a ranked list of themes (at maximum 100) 
for each test document in the theme classification task and a ranked list of F-terms (at 
maximum 200) for each test document in the F-term classification task. Note that the 
participants were given the themes of each test document in the F-term classification 
task. 


4.3.3.3 Evaluation 


MAP and F-measure were used in the evaluation. To calculate the F-measure, par- 
ticipants were requested to submit a confident set of themes or F-terms for each test 
document. 


4.3.3.4 Participants 


Four groups submitted results to the theme classification task in NTCIR-4. Theme 
classification is similar to classifying patent applications into IPC sub-classes; k- 
Nearest Neighbor (k-NN) and naive Bayes classifiers were popular methods, and the 
participants used these methods in the task (Kim et al. 2005; Tashiro et al. 2005). 

Three groups participated in the F-term classification task in NTCIR-5 and six in 
NTCIR-6. Some groups used support vector machine (SVM) (Tashiro et al. 2005; 
Li et al. 2007) in addition to k-NN (Murata et al. 2005) and naive Bayes (Fujino and 
Isozaki 2007) classifiers. The results suggested that feature selection had a greater 
influence on classification effectiveness than the choice of classifier. Since patent 
applications have several components including abstract, claim, technological field, 
purpose, embodiments, etc., we have many options for which components should be 
used as the source of features. 


4.3.4 Patent Mining Task: NTCIR-7, NTCIR-8 


Patent mining tasks were performed in NTCIR-7 (Nanba et al. 2008) and NTCIR-8 
(Nanba et al. 2010). Table 4.3 summarizes the test collections of the patent mining 
tasks. 

The purpose of the patent mining task was to create technical trend maps from a 
set of research papers and patents. Table 4.4 shows an example of a technical trend 
map. In this map, research papers and patents are classified in terms of elemental 
technologies and their effects. 
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Table 4.3 NTCIR test collections for patent mining 


Task NTCIR-7 NTCIR-8 
Clasification Clasification Map creation 
Test document 879 549 200 
Training Patent 1993-2002 ja, en 
documents applications 
Abstracts of 1988-1999 ja, en 
research papers 


Table 4.4 Example of a technical trend map created from a set of research papers and patents 


Effect 1 Effect 2 Effect 3 
Technology 1 [AA 1993] [BB 2002] 
[US Pat. XX/XXX] 
Technology 2 [CC 2000] 
Technology 3 [US Pat. YY/YYYY] | [US Pat. ZZ/ZZZZ] 
[JP Pat. 
WW/WWww] 


Two steps were used to create a technical trend map: 


(Step 1) For a given field, collect research papers and patents written in various 
languages. 

(Step 2) Extract elemental technologies and their effects from the documents col- 
lected in Step 1 and classify the documents in terms of the elemental technologies 
and their effects. Example of elemental technologies and their effects will be shown 
in Sect. 3.4.4. 


Two subtasks were conducted in each step: 


Classify research paper abstracts. 
Create a technical trend map. 


We describe the details of these subtasks below. 


4.3.4.1 Research Paper Classification Subtask 


The goal of this subtask was to classify research paper abstracts in accordance with 
the IPC system, which is a standard hierarchical patent classification system used 
around the world. One or more IPC codes are manually assigned to each patent, 
aiming for effective patent retrieval. 


Task 

This task involved assigning one or more IPC codes at the subclass, main group, 
and subgroup levels to a given topic, expressed in terms of the title and abstract of a 
research paper. 
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The following tasks were conducted. 


e Japanese: classification of Japanese research papers using patent data written in 
Japanese. 

e English: classification of English research papers using patent data written in 
English. 


Data Collection 

We created English and Japanese topics (titles and abstracts) and their correct classi- 
fications (IPC codes extracted from patents). On average, 1.6, 1.9, and 2.4 IPC codes 
were assigned at the subclass, main group, and subgroup levels, respectively, to each 
topic. In NTCIR-7, we randomly assigned 97 topics to the dry run and the remaining 
879 topics to the formal run. In NTCIR-8, we assigned 95 topics to the dry run and 
the remaining 549 topics to the formal run. The dry run data were provided to the 
participating teams as training data for the formal run. Patents with IPC codes were 
also provided as additional training data. 


Submission 
Participating teams were asked to submit one or more runs, each of which contained 
ranked lists of IPC codes for each topic. 


Evaluation 
MAP, recall, and precision were used in the evaluation. 


Participants 

In NTCIR-7, we had 24 participating systems for the Japanese subtask, 20 for the 
English subtask, and five for the cross-lingual subtask. As far as the number of groups 
is concerned, we had 12 participating groups from universities and companies. In 
NTCIR-8, there were 71 participating systems for the Japanese subtask, 24 for the 
English subtask, and nine for the cross-lingual subtask. There were six participating 
groups. 

Most participating teams employed the k-Nearest Neighbor (k-NN) method, 
which is a comparatively easy way of dealing with a large number of categories, 
because the classification is based only on extracting similar examples, with no 
training process being required. Furthermore, the k-NN method is itself a ranking, 
which enables it to be applied directly to the IPC code ranking. In NTCIR-7, Xiao et 
al. (2008) used the k-NN framework and various similarity calculation methods and 
re-ranking methods were examined. 


4.3.4.2 Technical Trend Map Creation Subtask 


Task 

This task was conducted in NTCIR-8. The goal of this subtask was to extract expres- 
sions of elemental technologies and their effects from research papers and patents. 
We defined the tag set for this subtask as follows: 
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e TECHNOLOGY including algorithms, tools, materials, and data used in each 
study or invention. 

e EFFECT including pairs of ATTRIBUTE and VALUE tags. 

e ATTRIBUTE and VALUE including effects of a technology that can be expressed 
by a pair comprising an attribute and a value. 


For example, suppose that the sentence “Through closed-loop feedback con- 
trol, the system could minimize the power loss.” is given to a system. In this 
case, the system was expected to output the following tagged sentence: “Through 
<TECHNOLOGY > closed-loop feedback control</TECHNOLOGY >, the system 
could <EFFECT><VALUE>minimize</VALUE> the <ATTRIBUTE>power 
loss</ATTRIBUTE> </EFFECT>.” 

The following tasks were conducted: 


e Japanese: extraction of technologies and their effects from research papers and 
patents written in Japanese. 

e English: extraction of technologies and their effects from research papers and 
patents written in English. 


Data Collection 

Sets of topics with manually assigned TECHNOLOGY, EFFECT, ATTRIBUTE, 
and VALUE tags were used for the training and evaluation. Here, we asked a human 
subject to assign these tags to the following four text types: 


Japanese research papers (500 abstracts) 
Japanese patents (500 abstracts) 
English research papers (500 abstracts) 
English patents (500 abstracts) 


Then, for each text type, We randomly selected 50 texts for the dry run and 200 
texts for the formal run. We provided the remaining 250 texts to the participating 
teams as training data. 


Submission 
The teams were asked to submit texts with automatically annotated tags. 


Evaluation 
Recall, precision, and F-measure were used in the evaluation. 


Participants 
In NTCIR-8, there were 27 participating systems for the Japanese subtask and 13 
for the English subtask. There were nine participating teams of universities and 
companies. For example, Nishiyama et al. (2010) used a system that applied a 
domain-adaptation method on both research papers and patents and confirmed its 
effectiveness. 
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4.4 Contributions 


This section chronologically summarizes NTCIR’s contributions to research activi- 
ties on patent information retrieval. Figure 4.1 shows an overview. 


4.4.1 Preliminary Workshop 


In 2000, the Workshop on Patent Retrieval was co-located with the ACM SIGIR Con- 
ference on Research and Development in Information Retrieval (Kando and Leong 
2000). This was the first opportunity for researchers and practitioners associated with 
patent retrieval to exchange knowledge and experience. The outcome of this work- 
shop motivated researchers to foster research and development in patent retrieval by 
developing large test collections. 


4.4.2 Technology Survey 


Following the workshop in 2000, NTCIR-3 was organized as the first evaluation 
workshop focusing on patent information retrieval (2001-2002). The task was a 
technology survey. 

Since patent offices publish patent applications in public, information retrieval, 
and natural language processing researchers can use them as a resource. The test 
collection constructed for NTCIR-3 was unique in that it contained not only patent 
applications but also search topics and their relevant documents; these were created 


2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 


A A A 4217 
SIGIR Workshop ACL Workshop IP&M special issue Springer book (1st) Springer book (2nd) 
technology survey NTCIR-3 TREC-CHEM 
invalidity search NTCIR-4 NTCIR-5 NTCIR-6 CLEF-IP 
TREC-CHEM 
passage retrieval NTCIR-5 CLEF-IP 
classification A NTCIR-5 NTCIR-6 CLEF-IP 
WIPO-alpha dataset 
mining p NTCIR-7 NTCIR-8 
image retrieval CLEF-IF 
image classification CLEF-IF 
chemical structure CLEF-IĦ 
recognition 
flowchart recognition CLEF-IP 


Fig. 4.1 History of research activities on patent information processing 
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and assessed by human experts. It was the first test collection for patent information 
retrieval with a large number of documents. 

Here we should note that other workshops included technology survey tasks 
similar to the one performed in NTCIR-3, including the TREC-CHEM tracks in 
2009 (Lupu et al. 2009), 2010 (Lupu et al. 2010) and 2011 (Lupu et al. 201 1a). 
These tasks focused on research and development in the chemical domain, in which 
patent information plays important role. 


4.4.3 Collaboration with Patent Experts 


The organizers of NTCIR-3 and NTCIR-4 collaborated with patent experts, who 
were JIPA members, in constructing the test collections. The JIPA members cre- 
ated the search topics and they also collected and assessed relevant documents. The 
organizers and the JIPA members met once a month to discuss the task design. 
The participants and the JIPA members also shared knowledge and experiences at 
round-table meetings and tutorials. These activities helped to build bridges between 
information retrieval researchers and patent searchers. 


4.4.4 Invalidity Search 


NTCIR-4 (2003-2004) was the first workshop to include an invalidity search. Inva- 
lidity search is truly patent-specific work, and the organizers carefully designed the 
task with JIPA members. 

In NTCIR-4, we examined the issue of whether it was possible to use only cita- 
tions as relevant documents when evaluating submitted runs. While the NTCIR-4 
collection included an exhaustive collection of relevant documents, the NTCIR-5 
and NTCIR-6 collections had only citations. Moreover, since the topics and the 
documents were the same in the NTCIR4, NTCIR-5, and NTCIR-6 collections, 
researchers can compare their retrieval methods under the different ways of identi- 
fying relevant documents. 

Invalidity search tasks were continuously organized in CLEF-IP in 2009 (Roda 
et al. 2009), 2010 (Piroi and Tait 2010) and 2011 (Piroi et al. 2011), and TREC- 
CHEM in 2009 (Lupu et al. 2009), 2010 (Lupu et al. 2010) and 2011 (Lupu et al. 
2011a), under the name “prior art search task”. 

The passage retrieval task in NTCIR-5 (2004-2005) was the first attempt to eval- 
uate handling of passages in invalidity search. A passage retrieval task was revisited 
in CLEF-IP in 2012 (Piroi et al. 2012) and 2013 (Piroi et al. 2013) in a more chal- 
lenging setting. In NTCIR, a relevant document to a search topic was given and the 
purpose was to find relevant passages in the given relevant document. On the other 
hand, in CLEF-IP, relevant passages were directly retrieved based on the claims in 
the search topic. 
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4.4.5 Patent Classification 


The WIPO-alpha collection, released in 2002, was the first test collection for patent 
classification. It consisted of 75,250 English patent documents labeled with IPC 
codes. Many research papers on patent classification have used WIPO-alpha (Fall 
et al. 2003). 

The NTCIR collections released by the classification tasks (2004-2007) were 
not for IPC, but for the classification codes used in the JPO, i.e., F-terms. F-terms 
re-classify a specific technical field of IPC from a variety of viewpoints, such as 
purpose, means, function, and effect. 

The CLEF-IP classification tasks in 2010 (Piroi and Tait 2010) and 2011 (Piroi 
et al. 2011) released test collections on IPC codes; these were larger than the WIPO- 
alpha collection. The released documents totaled 2.6 million in 2010 and 3.5 million 
in 2011. 


4.4.6 Mining 


Patent mining tasks were performed in NTCIR-7 and NTCIR-8, and similar tasks 
were conducted in subsequent research (Gupta and Manning 2011; Tateisi et al. 
2016). Gupta and Manning (2011) proposed a method to assign FOCUS (an article’s 
main contribution), DOMAIN (an article’s application domain), and TECHNIQUE 
(a method or a tool used in an article) tags to abstracts in the ACL Anthology? for 
the purpose of identifying technical trends. Tateisi et al. (2016) constructed a corpus 
for analyzing the semantic structures of research articles in the computer science 
domain. 

Since February 2019, JDream IIT? has provided a new service for retrieving 
research papers using IPC codes. This service assigns IPC codes at the main group 
level to each research paper by using Nanba’s method (Nanba 2008), which is based 
on the k-NN method. 


4.4.7 Workshops and Publications 


The organizers of the NTCIR tasks organized the ACL Workshop on Patent Corpus 
Processing in 2003 and edited a special issue on patent processing in Information 
Processing & Management in 2007 (Fujii et al. 2007a). 

The Information Retrieval Facility (IRF), which was a not-for-profit research 
institution based in Vienna, Austria, organized a series of symposia between 2007 
and 2011 to explore reasons for the knowledge gap between information retrieval 


*https://www.aclweb.org/anthology/. 
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researchers and patent search specialists. The symposia were followed by publication 
of two editions of a book in 2011 (Lupu et al. 2011b) and 2017 (Lupu et al. 2017) 
introducing studies by information retrieval researchers and patent experts. 

These activities contributed to the research trends in the communities of informa- 
tion retrieval and natural language processing. 


4.4.8 CLEF-IP and TREC-CHEM 


The NTCIR project ended with NTCIR-8 (2009-2010), and it left behind several 
unaddressed issues. Firstly, while NTCIR-3 and NTCIR-4 released multi-lingual 
search topics in English, Korean and Chinese, as well as English abstracts over 
the course of ten years and NTCIR-6 included an English retrieval task using patent 
applications published by USPTO, the workshops focused on Japanese and the multi- 
lingual resources were not widely used. This meant there were no serious evaluations 
of multi-lingual or cross-lingual patent retrieval. Secondly, the tasks ignored images, 
formulas, and chemical structures, despite the fact that these are important pieces of 
information for judging relevance in some domains. 

The above issues that the NTCIR project did not address were investigated in 
CLEF-IP (2009-2013) and TREC-CHEM (2009-2011). Both were annual evalua- 
tion workshops (campaigns) on patent information retrieval. CLEF-IP had tasks for 
prior art search, passage retrieval, and patent classification. The tasks were simi- 
lar to the NTCIR tasks, but most resources were from the European Patent Office, 
covering English, French, and German; hence, the CLEF-IP tasks were inherently 
multi/cross-lingual. In addition, CLEF-IP performed completely new tasks, includ- 
ing ones on image-based retrieval (Piroi et al. 2011), image classification (Piroi et al. 
2011), flowchart/structure recognition (Piroi et al. 2012, 2013), and chemical struc- 
ture recognition (Piroi et al. 2012). TREC-CHEM also had tasks for prior art search 
and technology survey. The TREC-CHEM tasks were challenging and focused on 
the chemical domain, which has many formulae and images. The image-to-structure 
task (Lupu et al. 2011a) in TREC-CHEM was the first one to include chemical 
structure recognition. TREC-CHEM used resources from USPTO. 
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Chapter 5 A) 
Multi-modal Summarization gaa 


Tsuneaki Kato 


Abstract Multi-modal summarization is a technology that provides users with 
abridgments of topics of interest. Such abridgments consist of organized text and 
informative graphics. These summarizations have two roles. One is to assist the 
users to review and understand their topics of interest. The other is to guide users both 
visually and verbally in their exploratory search. To establish this technology, it was 
necessary to integrate several research streams. These included information access, 
information extraction, and information visualization; all of these technologies had 
been developing rapidly since the beginning of the twenty-first century. MuST was a 
workshop, the main theme of which was research on multi-modal summarization of 
trend information. It was not an evaluation workshop and did not present the partic- 
ipants with a specific task, because at the time when the workshop was conducted, 
multi-modal summarization was merely an agglomeration of yet-to-be-developed 
technologies that had not yet been fully synthesized. Rather than sharing a task, 
the MuST workshop shared a data set. Making an annotated corpus shared as its 
unifying force, the workshop encouraged cooperative and competitive researches on 
trend information. Several innovations emerged from the workshop. These covered 
trend information extraction, visualization as information access interface and as 
data analysis method, linguistic summary generation from charts, and trend mining. 


5.1 Background 


By the beginning of the twenty-first century, information access technologies had 
changed and diversified. What was being accessed had changed from entire docu- 
ments to passages within documents, and thence to the information itself. Question 
answering, the motto of which was to return information itself rather than pages or 
documents, had already progressed to managing simple factoid questions, and was 
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expected to reply to increasingly complicated queries such as those that included 
causes and definitions. 

Access methods had also changed. Exploratory and interactive search was being 
emphasized. Information gathering was no longer a one-shot interaction through 
which users described their interest precisely and in return obtained adequate rele- 
vant feedback; instead, the process had become continuous, wherein users browsed 
information that was gathered according to general descriptions and then identified 
aspects regarding which they need more detailed information. Through this process, 
users interactively accumulated information while simultaneously expanding their 
area of exploration. 

Methods for displaying the information so obtained had also advanced from sim- 
ple ranked lists to information visualization. Some visualization techniques helped 
users to represent their information requests visually, others helped them to inter- 
actively analyze and interpret the results. Such information visualization techniques 
for information access were new and had different characteristics from those for 
scientific visualization. 

Information was no longer simply collected or retrieved. Advances now allowed 
it to be compiled and synthesized using information extraction and multi-document 
summarization, which were techniques that had matured during that period. 

Some of the research fields, such as exploratory search and information visual- 
ization, that adopted such changes in that era closely interacted with each other. This 
was, however, not the case for many other fields. Although one could find a limited 
implementation of some aspects (Ahmad et al. 2004), at that time, it was not envi- 
sioned that anything similar to the recent disaster informatics system would arise; 
this system synthetically processes both numeric data and linguistic data, such as 
documents, and summarizes and visualizes that data according to the users’ require- 
ments. There was, however, an expectation that interactions among, and fusions of, 
those research fields would bring about a number of fundamental innovations. 


5.2 Applications Envisioned 


These anticipated fusions could take many forms. One form could lead to a sophisti- 
cated question-answering system for responding to queries such as “How have oil and 
gasoline prices changed this year?” or “How bad were the typhoons last year?” The 
system would achieve this by compiling text and statistical data and then generating 
combinations of succinct text and information graphics. More advanced applica- 
tions of such systems may include patent or research-map generation, which would 
show and explain the trends of patent applications or the publication of scholarly 
papers. These potential developments were subsequently pursued in another NTCIR 
workshop, which is briefly mentioned in Sect. 5.4. 

This mechanism, which we termed multi-modal summarization, can be regarded 
as an effort to expand text summarization. While text summarization extracts impor- 
tant content from a body of real-world text and presents it in a condensed form, multi- 
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modal summarization also processes non-linguistic information such as numerical 
data and information graphics. Whereas multimedia presentation generation (Fas- 
ciano and Lapalme 1996; Roth and Mattis 1990 for example), which had been actively 
studied at the end of the last century, aimed to generate multimedia presentations 
from media-independent semantic representations; multi-modal summarization does 
not presume the existence of such well-formed semantic representations and grapples 
with the enormous amount of unstructured and uncoordinated information available 
in the real world. 

Another form of fusion supports interactive and exploratory search. It interprets 
and guides users’ queries linguistically and visually, progressing from the abstract to 
the concrete and thence to the specific. For example, initially, one may be interested 
in the annual movement of the oil price but later become interested in the change 
at a specific point in time, and finally, decide to investigate the cause and effect of 
that change. It also supports users’ analysis of a series of events by showing various 
data from several viewpoints. The occurrence of typhoons is plotted on a geographic 
space and time scale and then linked to data on resultant damage and its associated 
verbal descriptions. At least two characteristics are required for such systems to be 
effective. Firstly, a framework is needed that seamlessly supports users throughout 
the information access process, from browsing an outline or summary to subsequent 
elaboration or specificity and to acquiring accurate information. Secondly, linguistic 
and non-linguistic information could be cooperatively employed in this process. 
Information need not be limited to text but may include non-linguistic information 
such as a series of numerical values. Non-linguistic modes could be utilized even 
during presentation, which would then lead onward to multi-modal presentation and 
information visualization. 

The term, multi-modal summarization, is also used for the second technology, 
though the name does not adequately emphasize the significance of interactivity and 
relationship to exploratory search. These technologies share the name because these 
techniques have a common core that compiles useful and relevant information and 
presents it to users utilizing multiple modes, including text and visuals. 


5.3 Multi-modal Summarization on Trend Information 


The MuST was a workshop on multi-modal summarization focused on trend infor- 
mation (Kato et al. 2005, 2007a, b, 2008). Why did we focus on trend information? 
It was because a trend, which is a general tendency in the way a situation is changing 
or developing,' is based on temporal statistical data and can be obtained by synthet- 
ically summarizing it, but not by simple enumeration. Trends are the first answers 
to users’ questions such as “How has the game machine industry performed since 
2006?”, “How have oil and gasoline prices changed this year?”, and “How bad were 
the typhoons last year?” Each answer to those questions can be considered a summary 
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of all the information that users are interested in and a starting point for interactive 
and explorative information access. 

The information from which trends are composed and the process of identifying 
trends have several interesting features. First, to obtain trends, it is necessary to com- 
pile information spanning a specific and extensive period. As they include significant 
redundancies, such compilations must be synthetic and well organized. Secondly, 
trends usually contain summaries of non-linguistic information, for example, statis- 
tical information such as time-series data and geometric data. Some statistics such as 
political party approval ratings and companies’ market share of a given product type 
are more complicated and have other dimensions. Each dimension could be an axis 
representing those statistics and bring different summarization methods. Thirdly, not 
only information such as reports on changes in statistical data, but also their inter- 
pretation, analysis of causes, and forecasts of impacts are important and should be 
included when defining trends. 

As trend compilation requires sophisticated processes for handling complex and 
diverse information, it is an important research subject for multi-modal summariza- 
tion aimed at supporting interactive and explorative information access. 


5.3.1 Objective 


The objective of the MuST workshop was to create an agora or arena where 
researchers from the several fields mentioned above could interact. The workshop 
prioritized trend analysis as its common theme because trends have interesting char- 
acteristics that are suitable as the starting point for exploratory search and as a sub- 
ject for analysis. The MuST workshop promoted both cooperative and competitive 
research on trend information. It was not an evaluation workshop and thus identi- 
fied neither a specific task nor evaluation measures.” For many, the workshop was 
motivated by a common evaluation. Sometimes the objective of the workshop was 
to enable large-scale evaluation, which required to employ the pooling method. It is 
beneficial to evaluate technologies on the common ground using standard measures. 
That, however, is only possible when technologies have matured or when they are 
focused on common objectives. Research on multi-modal summarization consists of 
many kernels of technologies still in development and not synthesized yet. Accord- 
ingly, each research group had its specific focus. In that situation, neither a common 
evaluation nor shared tasks were possible or stimulating. That is why we did not 
conduct an evaluation-oriented workshop. We needed another motivation to make 
the workshop cooperative and competitive, yet still, allow the participants to focus 
on their interests. 

The MuST workshop was conducted a bit earlier than the IEEE VAST shared- 
task evaluation (IEEE symposium on VAST 2006). Although both were concerned 


2In its third cycle, however, some evaluation tasks were set. Those tasks were considered as shared 
building blocks common to trend information summarization. 
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with visualization technology, they were different in nature. MuST addressed various 
problems, rather than a substantial single problem such as the one that IEEE VAST 
undertook. Rather, the policy of MuST was similar to that of the interactive track held 
in TREC 6 (Dumais and Belkin 2005), in which, through a common experiment, the 
participants conducted their own studies; such individual studies are more productive 
than a joint evaluation. During the MuST workshop, many technologies reflecting 
each participant’s interests were examined. Although they would be associated with 
each other later in the process, initially, they did not have the same goal. 


5.3.2 Data Set as a Unifying Force 


Instead of a common topic for evaluation, a data set provided a unifying force for 
the MuST workshop. The use of a shared resource, which motivated researchers to 
participate and to conduct several research missions, was the major characteristic 
of the workshop. The resources that were shared, the MuST data set, included the 
materials to be processed, the intermediate results acting as the organizational hub, 
and the eventual output design. 

The core of the data set is annotated newspaper articles concerning statistics 
and a wide variety of topics. The topics were drawn from disparate social and 
economic domains, such as the oil industry, the personal-computer market, and car 
production; groups of events such as earthquakes and typhoons; and organizations 
such as Sony Corp. Linguistic descriptions of statistics and reports on events in 
articles were identified and annotated, as trends would be extracted from them. For 
example, trends in the personal-computer industry included statistics on shipment 
volume, shipment value, and market share of major manufacturers. Typhoon trends 
consisted of a review of typhoon-related events, such as their formation, landfalls, 
and related damage statistics. 

Examples of English texts to which the annotation schema was applied are shown 
in Fig. 5.1, instead of the real data, which is in Japanese. Sentences mentioning 
selected statistics or events are annotated as unit elements. From the text of an 
unit element, phrases mentioning the name of the statistic (name element), the 
value of the statistic (val element), the relative values, which are associated with 
the statistic but are not the value itself (rel element), dates (date element), and 
other parameters (par element) are identified and annotated. 

The annotation of the MuST data set represents the intermediate result of semantic 
and pragmatic analysis tuned to statistical and/or event information. In the summa- 
rization, extraction and analysis of important sentences are followed by rephrasing 
and sentence construction to eliminate redundancy and maintain consistency. Anno- 
tation corresponds to the output of extraction and analysis and the input to rephrasing 
and sentence generation. Using the terminology of the information extraction field, 
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<unit stat="nationwide average of pump price of gasoline"><del 
type="src">Based on the July 6th report announced by the Oil Information Center,</de1> 
<name part="head">the price of gasoline (one liter, regular)</name>, based on the 
research conducted <date gra="week" abs="19990617">this week</date>, reached 
a <name part="foot">national average</name> of <val>92 yen</val>, <rel 
type="diff">1 yen</rel> higher than <date gra="week" abs="19990610">last 
week</date>’s <name part="head">average price</name></unit>. 


<unit stat="Dubai oil price"><name>The oil price (Dubai Oil)</name> 
has kept dropping since its <rel type="ord">peak</rel> <date gra="month" 
abs="199710">last October</date>, of <val>around $20</val> <name 
part="foot">per barrel</name>, and fell to <val>$12.50</val>. in 
<date gra="ten-days" abs="19980121">late January</date></unit>. 
<unit stat="Dubai oil price">After <date gra="ten-days" 
abs="19980121">that</date>, ø<ins type="name">Dubai_ oil  price</ins> 
rose temporarily, <del type="rsn">because of tension due to the Iraq situation,</del> 
but has been struggling recently at <val>around $10</val><del type="other">, due to 
oversupply ”</del></unit>. 


<unit event="typhoon landfall">Medium-strength <par>typhoon No. 
10</par> struck <par>Makuraszaki-shi, Kagoshima</par>, at <date gra="hour" 
abs="19980917">about 4:30 pm on the 17th</date>, and will strike in <par>the 
vicinity of Shukumo-shi, Kochi</par><date gra="hour" abs="19980917">the same 
night</date></unit>. 


<unit stat="domestic shipment volume"><name>The domestic shipment vol- 
ume </name>for <date gra="half-year" abs="199804">the first half of the 
year</date> was <val>4,391,000</val>, which is <rel type="prop">34%</rel> 
higher than for <date gra="half-year" abs="199704">the same period last 
year</date> and marked <rel type="ord">the highest level</rel> for <name 
part="foot">the half-year range</foot></unit>. 


Fig. 5.1 Examples of MuST data annotations on English text 


this annotation completes named-entity recognition and temporal-expression anal- 
ysis. For researchers who are interested in sentence extraction or text processing 
on named-entity recognition and temporal-expression analysis, annotation can be 
referred to as the gold standard of their process. It can also be used as training data 
if they take a machine learning approach. For researchers interested in rephrasing, 
sentence generation, and information visualization, annotation can be used as input 
data in which several fundamental analyses are already completed. In extreme cases, 
studies on information visualization from the text could be conducted without text 
processing. In this sense, the annotated articles behave as a hub for multi-modal 
summarization. 

Multi-modal summarization requires several component technologies that are 
dispersed across many research fields. This makes it difficult to construct an inte- 
grated system. By using this data set, nevertheless, the participants can address their 
own subjects of interest. This is especially important for those studying elemental 
technologies. Moreover, participants from different communities can discuss their 
interests with each other using the data set as common ground and can contemplate 
how their studies or their modules fit into the framework. Of course, researchers 
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having the same interest can use the data set as material for objective evaluation. To 
encourage and foster research through such interchanges was the objective of sharing 
this research resource and of the MuST workshop. 


5.3.3 Outcome 


Many research themes were pursued in the MuST workshop and several technologies 
emerged from it. These include extraction of statistics from texts as materials for trend 
summarization; visualization of statistical information extracted and/or collected; 
generation of text that explains statistical information; and trend mining that is a 
version of text mining, and attempt to find and visualize trends from huge document 
sets. 


5.3.3.1 Trend Information Extraction 


Information extraction on statistics from the text was a major sub-problem of trend 
summarization. Many participants had addressed this problem, which is the reason 
that this theme was pursued in the evaluation-workshop style at the final cycle of the 
MuST workshop. 

The simplest form of information extraction is to obtain as many tuples as possible 
of three elements; the name of a statistic, the date, and a value for the statistic on 
that date, an example of which looks like this; (Dubai oil price, 1998/12/21, $12.50). 
That triplet constitutes points plotted on the chart depicting the changes or trends of 
a given statistical category. Many complicated problems would remain even if the 
date and numeral expressions could be extracted using techniques of named-entity 
recognition. Those difficulties are epitomized in the first passage shown in Fig.5.1, 
“the price of gasoline (one liter, regular), ..., reached a national average of 92 yen, 
1 yen higher than last week’s average price.” 

First, the names of statistics are long and complex; they are frequently abbreviated 
and may be expressed in more than one way. These are usually expressed as a noun 
phrase, but sometimes split into many phrases. That is the case in this example 
in which the name of the statistic discussed is a national average of pump price of 
gasoline (one liter, regular). A method to handle such complex names of statistics was 
proposed. It deconstructs statistic names into their components and categorizes those 
characteristics and functions. To identify the name in its entirety, the method first 
identifies each component by text-chunking and then assembles those components 
into one name (Mori et al. 2008). 

Second, not all numerical expressions directly describe the statistical values. Some 
of them are comparative or relative expressions. In this example, “1 yen” is not a 
gasoline price itself but the differential of two prices. Such relative expressions must 
be distinguished from direct expressions of statistical values. On the other hand, 
using such comparisons, an additional triplet instance of the gasoline price, “last 
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week,” and “91 yen” could be obtained. Methods were proposed for distinguishing 
those expressions and using them to obtain additional triplet instances. 

Besides, relative or context-dependent time expressions such as “last week” and 
cases where more than one statistic is mentioned in a sentence raised problems that 
are still to be solved. 

Other research paid attention to extraction of information beyond simple triplets. 
Qualitative expressions, such as “peak” and “keep dropping” in the second passage 
in Fig.5.1, were used cooperatively with numerical data representations for trend 
summarization. Descriptions of causes of events described such as “because of the 
tension of the Iraq situation” are useful for understanding context. Techniques were 
proposed for extracting and using such descriptions for summarization and visual- 
ization purposes. 


5.3.3.2 Visualization 


The interactivity of visualization was a major feature identified as an objective in 
the MuST workshop. Interactivity allows for interactive and exploratory search. 
Techniques were proposed that would assist users to analyze trends from various 
viewpoints and provide response mechanisms for new requests that emerged from 
such analysis. 

Figure 5.2a shows an example of visualization as information access interface 
(Matsushita et al. 2004). A line chart was used as an information access interface. 
The chart as a whole represents the changes of a statistic of interest. The data points 
and segments are connected to the article that describes those statistics. Users can 
easily go back and forth between the chart and the articles as they are interconnected. 
This is a technique known as brushing (Scherr 2008). In another visualization shown 
in Fig. 5.2b, the line chart is augmented by schematic shapes that represent qualitative 
changes extracted from articles, such as “rebounding” and “continuing to increase” 
(Matsushita and Kato 2006). This chart can also be interconnected with textual 
materials. This is a typical example of multi-modal summarization. 

For data analysis using visualization, a framework named a visualization cube was 
proposed (Takama and Yamada 2009, 2010). Events such as earthquakes which are 
characterized by time and geographical locations have their features represented as 
a cube, which allows the systematic manipulation of visual representation according 
to changes in the user’s viewpoint. That is, a user can, through intuitive operations, 
freely place earthquakes of interest on a topographical map or on a timeline. Figure 5.3 
schematically shows this operation. Statistics can be handled similarly. Each statistic 
corresponds to one cube and the cubes can be stacked upon each other. This operation 
corresponds to drawing a stacked bar chart. Changing the granularity of the chart or 
focusing on a specific data range are also defined as operations of particular cubes. 
Thus, it is a visualized version of an OLAP cube (Codd et al. 1993) used in online 
data analysis. 
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(a) 


Dubai Oil Price 6s Articles on Dubai Oil Price 


(b) 


Dubai Oil Price ; a 


| 


Fig. 5.2 Two examples of visualization for information access interface from Matsushita et al. 
(2004), Matsushita and Kato (2006) 


5.3.3.3 Linguistic Summary Generation from Charts 


Summarization can be done using linguistic expressions. A typical approach is to 
redact long documents into succinct phrases. In multi-modal summarization, series 
of numbers, tables, and charts can be verbalized. This makes it possible for complex 
numerical dynamics to be expressed in a short descriptive phrase such as “wild 
gyration.” 

This method was proposed for generating paragraph-length documents to explain 
aline chart of a given set of statistics. (Kobayashi et al. 2007; Kobayashi and Okumura 
2008). The method for determining such content is critical. The chart is segmented, 
and a description of the relevant values and a description of the shape of the segments 
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Fig. 5.3 Visualization for data analysis from Takama and Yamada (2009) 


are decided and then appropriately linked to the content. The sets of two types of 
texts, those for describing values and those for shapes, are stored and used in the 
system as linguistic knowledge that is drawn from the corpus of real-life human 
explanations. 


5.3.3.4 Trend Mining 


Some trend summarizations can be conducted with a broader perspective via a ver- 
sion of text mining, which we termed as trend mining, that reveals current trends. 
Keywords, such as names of statistics, are linked to relevant topics. The observation 
that certain keywords appear frequently in documents reveals a trend that specific 
subjects are topical. Moreover, the co-occurrence pattern of those keywords sug- 
gests their relationship. One proposal visualized the relationship of statistical terms 
by calculating their co-occurrence frequencies. Such patterns are characteristic of 
events and phenomena in the real world (Kawai et al. 2008). The dynamic network 
established in this way allows users to review the structure of complex and global 
problems. Reviewing this, the user can discover the structure of a given problem and 
other useful related factors, thus facilitating access to accurate information about it. 
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5.4 Implication 


The MuST workshop was conducted from 2005 to 2008 at the NTCIR-5, 6, and 7 
workshops. It was a pilot task at first, and then became a core task with an evaluation 
subtask. Research activities on multi-modal summarization and trends went beyond 
these workshops. For five years, since 2006, special theme sessions were held at 
annual conferences of the Japan Society for Artificial Intelligence (JSAI). These 
focused on information compilation (Kato and Matsushita 2006), which aimed at 
using multi-modal summarization as an interface for interactive information access. 
It was emphasized that linguistic and non-linguistic information should be managed 
and utilized seamlessly. In 2009, a special interest group of the same name was 
launched by the JSAI. In 2012, it was renamed to Interactive Information Access 
and Visual Mining, and its activities have continued to the present (SIG-AM 2020). 

In the NTCIR workshops, at NTCIR-8, an evaluation task was conducted on 
interactive information access using visual information (Kato et al. 2011). The patent 
information mining task in NTCIR-8 also handled text data and numerical data and 
extracted some trends observed in patent information (Nanba et al. 2010). 

It is doubtful whether the MuST workshop itself had any direct influence on subse- 
quent research trends. The workshop, however, contributed to advancing research on 
information access. Explanatory search has since become a key research area. Visual 
interfaces are an important component of such research. The MuST workshop was 
a significant catalyst in these developments. 
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Chapter 6 A) 
Opinion Analysis Corpora Across ciecie; 
Languages 


Yohei Seki 


Abstract At NTCIR-6, 7, and 8, we included a new multilingual opinion analysis 
task (MOAT) that involved Japanese, English, and Chinese newspapers. This was 
the first task that compared the performance of sentiment retrieval strategies with 
common subtasks across languages. In this paper, we introduce the research question 
posed by NTCIR MOAT and present what has been achieved to date. We then describe 
the types of tasks and research that have involved our test collection both previously 
and in current research. Finally, we summarize our contributions and discuss future 
research directions. 


6.1 Introduction 


Sentiment analysis (sometimes called “opinion mining”) is a research topic that has 
been actively discussed and developed for some 20 years, particularly in the fields 
of natural language processing (NLP) and information retrieval (IR) (Pang and Lee 
2008). In this paper, we introduce the multilingual opinion analysis task (MOAT) 
(Seki et al. 2010, 2008, 2007), which was included in NTCIR-6, 7, and 8 (2006- 
2010). We then discuss the role and novelty of the task in sentiment analysis research. 

Sentiment analysis research began in 2002 (Pang et al. 2002; Turney 2002; Wiebe 
et al. 2002). Various frameworks for classifying documents in terms of positivity 
or negativity that use either supervised learning (Pang et al. 2002) or unsuper- 
vised learning (Turney 2002) have been proposed. In parallel, many researchers 
started to build opinion corpora based on newspaper articles (Wiebe et al. 2002) for 
multi-perspective question answering (MPQA). Other early research work was pub- 
lished at the AAAI 2004 Spring Symposium: Exploring Attitude and Affect in Text: 
Theories and Applications (Shanahan et al. 2006). 
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At the Text Retrieval Conference (TREC) in 2006, a new “Blog Track” was 
introduced, and was continued until 2010.' The original organizers released the 
TREC Blogs06 Collection (Macdonald and Ounis 2006), for which there have been 
100,649 blog posts (excluding duplicate documents) and over 3.2 million permalinks. 
This dataset was used for the opinion finding (blog post) retrieval task in the TREC 
2006 Blog Track and for the polarity opinion finding (blog post) retrieval task in the 
TREC 2007 Blog Track. In addition, the MPQA opinion corpus from the University 
of Pittsburgh (Wiebe et al. 2005), which defines a framework for opinion annotation 
using multiple assessors, has been released. 

Building on this previous work, we introduced our opinion analysis task at NTCIR- 
6 in 2006. The novel aspects of the NTCIR MOAT task can be summarized as follows: 


1. We have released an opinion annotation corpus for evaluation workshops. The 
annotation units include opinionatedness, topic relevance, polarity, opinion holder 
(from NTCIR-6), and opinion target (from NTCIR-7). 

2. We have provided a multilingual opinion corpus that includes material in English, 
Chinese, and Japanese. 

3. The topic set in the evaluation corpus is shared across languages. 


In Sect. 6.2, we give details of the NTCIR MOAT design to clarify its novel fea- 
tures and suggest an opinion corpus annotation strategy for evaluation workshops. In 
Sect. 6.3, we explain the evolution of opinion analysis research since the introduction 
of MOAT. Finally, in Sect. 6.4, we conclude our remarks and discuss future research 
directions. 


6.2 NTCIR MOAT 


6.2.1 Overview 


NTCIR MOAT was held at NTCIR-6 (Seki et al. 2007), NTCIR-7 (Seki et al. 2008), 
and NTCIR-8 (Seki et al. 2010). The task definition evolved through the three ses- 
sions, as shown in Table 6.1. 

The goal of the task is to form a bridge between element technologies such as 
opinion/polarity sentence classification or opinion holder/target phrase recognition 
to an application such as (opinion) IR or question answering. The target languages 
include English, Chinese (both Traditional and Simplified), and Japanese, and the 
topic set for IR or question answering is shared across languages. We have prepared 
a document set relevant to the topics retrieved from newspaper articles published 
in each target language, and have evaluated the system using these document sets 
annotated with multiple assessors. 


‘http://trec.nist.gov/data/blog.html. 
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Table 6.1 MOAT progress during NTCIR-6, 7, & 8 


NTCIR-6 NTCIR-7 NTCIR-8 
Target English, Japanese, Traditional Chinese 
Language = +Simplified Chinese 
Subtasks Opinionated, +Target +Cross-lingual 
Relevance, 
Polarity, Holder 
Annotation Sentence Opinion Clause 
Unit 
Focused Information Retrieval | Q&A Opinion Q&A 
Application (ACLIA*) 
Target Mainichi, Yomiuri, +Xinhua Chinese +NYT, UDN 
Corpora CIRB, Xinhua 
English, 


Hong Kong Standard, 
etc. 


(Period) 1998-2001 2002-2005 
*http://research.nii.ac.jp/ntcir/permission/ntcir-7/perm-ja- ACLIA htm] 


6.2.2 Research Questions at NTCIR MOAT 


Many researchers have focused on a resourceless approach to sentiment analysis 
(Elming et al. 2014; Le et al. 2016). Blitzer et al. (2007) proposed a domain adaptation 
approach for sentiment classification. Wan (2009) addressed the Chinese sentiment 
classification problem by using English sentiment corpora on the Internet. This type 
of research can be categorized as a semi-supervised approach to opinion/sentiment 
analysis that aims to solve the resource problem by using small labeled and large 
unlabeled datasets. We recognize that addressing language resource problems in 
sentiment analysis for nonnative languages is an important research area. Alter- 
natively, applications such as the Europe Media Monitor (EMM) News Explorer? 
provide an excellent service by including viewpoints from different countries. We 
also understand that providing these varied opinions from different countries offers 
opportunities for better worldwide communications. NTCIR MOAT is the first task 
to provide opportunities for nonnative researchers to develop a sentiment analysis 
system for low-resource languages and to bridge cultures by clarifying opinion dif- 
ferences across different languages. 


*http://emm.newsbrief.eu/NewsBrief/clusteredition/en/latest.html. 
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6.2.3 Subtasks 


With the broad range of information sources available on the web and in social media, 
there has been increased interest by both commercial and governmental parties in 
trying to analyze and monitor the flow of prevailing attitudes from anonymous users 
automatically. As a result, the research community has given much attention to 
automatic identification and processing of the following. 


e Sentences in which an opinion is expressed (Wiebe et al. 2004), 

The polarity of the expression (Wilson et al. 2005), 

The opinion holders of the expression (Choi et al. 2005), 

The opinion targets of the experssion (Ruppenhofer et al. 2008), and 
Opinion question and answering (Stoyanov et al. 2005), (Dang 2008).° 


With these factors in mind, we defined the subtasks in NTCIR MOAT as follows. 


1. Opinionated sentences 
The judgment of opinionated sentences is a binary decision for all sentences. 

2. Relevant sentences 
Each set contains documents that are found to be relevant to an opinion question, 
such as that shown in Fig.6.1. For those participating in the relevance subtask 
evaluation, each opinionated sentence should be judged as either relevant (Y) or 
non-relevant (N) to the opinion questions. In NTCIR-8 MOAT, only opinionated 
sentences were annotated for relevance. 

3. Opinion polarities 
The polarity is determined for each opinion clause. In addition, the polarity is to 
be determined with respect to the topic description if the sentence is relevant to 
the topic, and based on the attitude of the opinion if the sentence is not relevant 
to the topic. The possible polarity values are positive (POS), negative (NEG), or 
neutral (NEU). 

4. Opinion holders 
The opinion holders are annotated in terms of opinion clauses that express an 
opinion. However, the opinion holder for an opinion clause can occur anywhere 
in the document. The assessors performed a kind of co-reference resolution by 
marking the opinion holder with the opinion clause if the opinion holder makes 
an anaphoric reference noting the antecedent of the anaphora. Each opinion 
clause must have at least one opinion holder. 

5. Opinion targets 
The opinion targets were annotated in a similar manner to the opinion holders. 
Each opinion clause must have at least one opinion target. 

6. Cross-lingual opinion Q&A 
The cross-lingual subtask is defined as the opinion Q&A task. Together with 
the questions in English, the answer opinions should be extracted in different 
languages. To keep it simple, the extraction unit is defined as a sentence. The 


3https://tac.nist.gov/2008/qa/index.html. 
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<TOPIC> 

<NUM>N03</NUM> 

<TITLE>Bali Island Terrorist Bombing</TITLE> 

<QUESTION> What reasons behind the 2002 Bali bombings were discussed?</QUESTION> 
<POLARITY>Neutral</POLARITY> 

<OPTYPE>Reason controversy</OPTYPE> 

<CONC>Bali Island Terrorist Bombing</CONC> 

<PERIOD>2002-1010</PERIOD> 

</TOPIC> 


Fig. 6.1 Example: opinion question fields at NTCIR-8 MOAT 


answer set is defined as the combination of the annotations for the conventional 
subtasks, with opinionatedness, polarity, and answeredness being matched with 
the definition in the question description. 


6.2.4 Opinion Corpus Annotation Requirements 


Opinion corpus annotation for multiple domains (as in news topics) usually requires 
expert linguistic knowledge because crowdsourcing annotation (such as the Amazon 
Mechanical Turk) does not fit the NTCIR MOAT annotation framework. We con- 
ducted our evaluation using agreed (intersection) annotations from multiple expert 
assessors. To check the stability of this evaluation strategy, we compared the evalu- 
ation results for agreed (intersection) annotation and selective (union) annotation to 
arrive at a gold standard for using NTCIR-8 MOAT submission data. 

For the English cases in Table 6.2 (the « coefficient between assessor annotations 
was 0.73) and the Traditional Chinese cases in Table 6.3 («x coefficient 0.46), the 
rank of the participants’ systems is different. Although the rank differences for the 
English cases were within statistical significance, among the Traditional Chinese 
cases, the precision-oriented systems (CTL and WIA) tended to be ranked higher for 
cases of agreed (intersection) annotation, and recall-oriented systems (KLELAB-/ 
and NTU) tended to be ranked lower. For the Simplified Chinese cases in Table 6.4 (« 
coefficient 0.97) and the Japanese cases in Table 6.5 («x coefficient 0.72), there was no 
rank difference for the participants’ systems despite the different strategies because 
of either high « agreement (Simplified Chinese) or a low number of participants 
(Japanese). From these observations, we concluded that the «x coefficient between 
assessor annotations should exceed 0.7 for stable evaluation. We also found that 
strong opinion definition and online annotation tools were helpful, but using expert 
linguistic annotators remained necessary to achieve high «x agreement. 
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Table 6.2 Evaluation strategy analysis using NTCIR-8 MOAT English raw submission data 
English (Fl-score)/k = 0.73 


Rank on agreed | Significance Rank on 
non-agreed 
UNINE-1 A UNINE-1 
NECLC-bsf A B NECLC-bs1 
NECLC-bs0 A B C NECLC-bsf 
NECLC-bs1 A B C D NECLC-bs0 
UNINE-2 B C D UNINE-2 
KLELAB-3 B C D E NTU-2 
KAISTIRNLP-2 C D E KLELAB-2 
KLELAB-2 C D E KLELAB-3 
KAISTIRNLP-1 C D E NTU-1 
NTU-2 D E F KLELAB-1 
KLELAB-1 D E F KAISTIRNLP-2 
NTU-1 D E F KAISTIRNLP-1 
OPAL-2 E F G OPAL-1 
OPAL-3 F G OPAL-2 
OPAL-1 F G OPAL-3 
PolyU-1 G SICS-1 
SICS-1 G H PolyU-1 
PolyU-2 H PolyU-2 


6.2.5 Cross-Lingual Topic Analysis 


We ranked topics by averaging their Fl-scores, the harmonic mean of precision 
and recall, obtained from all NTCIR-8 MOAT raw submissions in the opinionated 
judgment subtask. The best three (easy) topics and worst three (difficult) topics and 
the opinion percentage in the source documents are shown in Table 6.6. 

From these results, we found that the topic difficulty is strongly related to each 
language. We also found that, with many opinions in the source, the topics tended 
to be easier. Exceptions to this rule included the opinion question for topic N16: 
“What reasons have been given for the anti-Japanese demonstrations that took place 
in April, 2005 in Peking and Shanghai in China?” We surmise that this was caused 
by the systems’ difficulty in judging quite sensitive opinions expressed in newspaper 
articles in each language. 
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Table 6.3 Evaluation strategy analysis using NTCIR-8 MOAT traditional Chinese raw submission 


data 


Traditional Chinese (F1-score)/k = 0.46 


Rank on agreed | Significance Rank on 
non-agreed 
CityUHK-2 A CityUHK-1 
CTL-1 A CityUHK-3 
CityUHK-1 A KLELAB-1 
City UHK-3 A NTU-1 
WIA-1 A NTU-2 
WIA-2 A CityUHK-2 
KLELAB-3 B cyut-1 
KLELAB-1 B KLELAB-3 
NTU-2 B WIA-1 
NTU-1 B WIA-2 
cyut-1 B cyut-2 
cyut-2 B CTL-1 
UNINE-1 UNINE-1 
cyut-3 cyut-3 


Table 6.4 Evaluation strategy analysis using NTCIR-S MOAT simplified Chinese raw submission 


data 

Simplified Chinese (Fl-score)/k = 0.97 
Rank on agreed | Significance Rank on 

non-agreed 

PKUTM-2 A PKUTM-2 
PKUTM-1 A B PKUTM-1 
BUPT-2 A B BUPT-2 
CTL-1 B CTL-1 
PKUTM-3 B C PKUTM-3 
BUPT-1 B C BUPT-1 
WIA-1 C D WIA-1 
WIA-2 C D WIA-2 
NECLC-bsf D NECLC-bsf 
NECLC-bs0 D NECLC-bs0 
NECLC-bs1 D NECLC-bs1 
PolyU-1 PolyU-1 
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Table 6.5 Evaluation strategy analysis using NTCIR-8 MOAT Japanese raw submission data 


Japanese (Fl-score)/k = 0.72 


Rank on Significance Rank on 
agreed non-agreed 
TUT-1 A TUT 1 
TUT-3 A B TUT-3 
IISR-3 B C IISR-3 
TUT-2 B C TUT-2 
TISR-1 B C HSR-1 
IISR-2 C IISR-2 
UNINE-1 D UNINE-1 


Table 6.6 Cross-lingual topic analysis using NTCIR-8 MOAT raw submission data 


English Traditional Chinese] Simplified Chinese] Japanese 

Topic Opinion | Topic Opinion | Topic Opinion | Topic Opinion 
% %o % % 
in doc in doc in doc in doc 


set 


set 


set 


set 


Easy 34.6 
topics 

35.3 

28.1 

Difficult | N18 7.6 N16 19.4 N07 9.5 N24 35.3 
topics 

N13 8.9 N13 15.0 N41 14.9 N18 37.7 

N06 10.0 N20 18.8 N16 20.6 N32 27.0 

Average | Avg. 16.7 Avg. 32.1 Avg. 18.6 Avg. 33.9 


6.3 Opinion Analysis Research Since MOAT 


6.3.1 Research Using the NTCIR MOAT Test Collection 


Some researchers have used the NTCIR MOAT test collection and presented their 
work at top-rated conferences, particularly those focused on cross-lingual sentiment 


analysis. Two representative examples are as follows. 


1. Joint Bilingual Sentiment Classification 
Lu et al. (2011) hypothesized that aligned sentences between languages should be 
similar in opinion polarity and strongness. They proposed a method for improving 
the polarity classification performance that used the MPQA opinion corpus and the 
NTCIR MOAT corpus as labeled corpora, and aligned news corpora in Chinese 
and English as unlabeled corpora. They extended their work by using a cross- 
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lingual mixture model (Meng et al. 2012) to improve performance when learning 
polarity clues from unlabeled corpora. 
2. Cross-lingual Sentiment Lexicon Learning 

Gao et al. (2015) proposed a method for generating low-resource language senti- 
ment lexicons using available English sentiment lexicons. They created Chinese 
sentiment lexicons using a bilingual word graph label propagation approach. 
They evaluated Chinese sentiment classification at the sentence level by using 
the NTCIR MOAT corpus and found increased effectiveness of sentiment classi- 
fication when using their generated sentiment lexicon to generate features. 


6.3.2 Opinion Corpus in News 


Several opinion corpora involving news have been developed after NTCIR MOAT 
was published. In this subsection, we introduce the SemEval-2007 Task 14: Affective 
Corpus (Strapparava and Mihalcea 2007) and the sentiment-annotated quotation set 
(Balahur and Steinberger 2009; Balahur et al. 2010). 

In the SemEval-2007 Affective Corpus, six emotion labels and two polarity labels 
have been annotated to headlines collected from 1,250 news websites and newspaper 
articles. The sentiment-annotated quotation set contains a set of 1,590 English lan- 
guage quotations (reported speech), manually annotated by two independent sets of 
annotators for sentiment (positive, negative, or objective/neutral) expressed toward 
the entities mentioned inside the quotation. Web crawling for news articles employed 
the EMM (Steinberger et al. 2009)* developed by the European Commission Joint 
Research Centre. 

The NTCIR MOAT corpus, however, remains in use as a large cross-lingual news 
opinion corpus targeted at Chinese, Japanese, and English. 


6.3.3 Current Opinion Analysis Research: The Social Media 
Corpus and Deep NLP 


After NTCIR MOAT was published, Twitter> and other microblog media came into 
widespread use by many users. The NLP/IR researchers also focused on tweet senti- 
ment analysis (Martinez-Camara et al. 2013). To improve sentiment classification in 
Twitter, specific clues were found to be useful because a tweet is much shorter than 
a news article, including tweet context (Jiang et al. 2011), emoticons and hashtags 
(Purver and Battersby 2012), lengthened words (Brody and Diakopoulos 2011), and 
emoji (Felbo et al. 2017). 


4http://emm.newsbrief.eu/overview.html. 
Shttp://twitter.com. 


92 Y. Seki 


On the other hand, deep NLP research such as Stanford Sentiment Treebank 
(Socher et al. 2013)° has become mainstream from a technological point of view. In 
this research, the learning model builds up a representation of whole sentences based 
on the sentence structure. An opinion corpus called the Stanford Sentiment Treebank 
has been developed to estimate compositionality in the sentiment detection task. 
It includes the fine-grained sentiment labels “very negative”, “negative”, “neutral”, 
“positive”, and “very positive” for 215,154 phrases in trees parsed with the Stanford 
Parser from 11,855 sentences extracted from movie reviews (Pang and Lee 2005). 

In SemEval 2018 (Mohammad et al. 2018), an opinion corpus has been created 
from 10,983 English, 4,381 Arabic, and 7.094 Spanish tweets, and used to evaluate 
the systems. Several tasks are defined that provide annotations for the mental state 
of the tweeter, including (1) the intensities of the four basic emotions (anger, fear, 
joy, and sadness), (2) the intensity of sentiment/valence (very negative, moderately 
negative, slightly negative, neutral or mixed, slightly positive, moderately positive, 
and very positive), and (3) multi-label emotion classification across 12 emotions 
(anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, 
trust, and neutral). The corpus used best—worst scaling (Louviere et al. 2015), a 
comparative annotation method in which assessors were asked what was the best 
(highest in terms of the property) and worst (lowest in terms of the property), given 
n items (typically n = 4). Real-valued scores for the association between the items 
and the property were determined based on the number of times an item was chosen 
as the best and the worst. The median number of assessors for each tweet was seven. 
The inter-annotator agreements (Fleiss ’s K) for the multi-label emotion classification 
were 0.21, 0.29, and 0.28 for the 12 classes, and 0.40, 0.48, and 0.45 for the four 
basic emotions in English, Arabic, and Spanish. Most of the participants employed 
SVM/SVR, LSTMs, and Bi-LSTMS as machine learning algorithms, and also took 
word embedding, affect lexicon features, and word n-grams as features. 

Although the document genres being focused on and the annotation properties 
have changed over time, cross-lingual opinion corpora remain important in current 
research. 


6.4 Conclusion 


In this paper, we have discussed the contributions made by our development of NTCIR 
MOAT. We created a cross-lingual opinion corpus using the news document genre, 
following which, several researchers have conducted cross-lingual opinion research 
using our test collections. Although sentiment classification accuracy is improved 
by using a cross-lingual corpus, research investigating linguistic opinion properties 
characterized by languages rooted in different cultures and opinion retrieval strategies 
preferable for different language characteristics remain to be undertaken. 


Shttps://nlp.stanford.edu/sentiment/. 
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In recent research, high-quality contextual representations based on neural archi- 
tectures such as ELMo (Peters et al. 201 8a) and BERT (Devlin et al. 2019) are proving 
to be effective in NLP research. In addition, linguistic properties such as morpholog- 
ical, local-syntax, and longer-range semantics tend to be treated at different layers, 
such as the word-embedding layer, lower contextual layers, or upper layers in each 
of these cases (Peters et al. 2018b; Jawahar et al. 2019). As an extension of bilingual 
sentiment word-embedding frameworks (Zhou et al. 2015), cross-lingual sentiment 
retrieval research that considers syntax and semantics in different languages will be 
an interesting direction for future work. 


Acknowledgements This work was partially supported by JSPS Grants-in-Aid for Scientific 
Research (B) (#19H04420). 
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Isao Goto 


Abstract The NTCIR patent translation task was the first task for the machine trans- 
lation of patents that used large-scale patent parallel sentence pairs. In this chapter, 
we first present the history of machine translation; the contribution of evaluation 
workshops to machine translation research, and previous evaluation workshops; and 
the challenge of patent translation at the time of the first patent translation task at 
NTCIR. We then describe the innovations at NTCIR, including the sharing of research 
infrastructure, the progress of corpus-based machine translation technologies, and 
evaluation methods for patent translation. Finally, we outline the developments in 
machine translation technologies, including patent translation and remark on the 
future of patent translation. 


7.1 Introduction 


Research on machine translation began in the 1950s immediately after the birth 
of computers. The first machine translation technology was Rule-Based Machine 
Translation (RBMT), which used manually built translation rules. RBMT was actively 
developed from the 1970s to the 1980s. In the late 1980s, research began on Statistical 
Machine Translation (SMT), which is a learning-based machine translation technol- 
ogy based on corpus statistics, (Brown et al. 1993). However, there was little research 
on SMT for about 10 years. Then the situation changed. From the late 1990s to 
around 2000, that is, since high-performance computers began to be in widespread 
use, large parallel corpora became available, automatic evaluation methods, such 
as BLEU (Papineni et al. 2002), were developed, and research on SMT began to 
progress rapidly. 

The progress of the research was facilitated by evaluation workshops. Evaluation 
workshops played a dual role in providing large datasets and making evaluations 
comparable using shared tasks. This made it possible to conduct experiments by 
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sharing research infrastructure and to verify the effectiveness of methods by per- 
forming comparisons using the same data. Evaluation workshops made research 
more active, and research on machine translation progressed. The following is a list 
of major evaluation workshops on machine translation that were in existence by the 
mid-2000s: 


e Defense Advanced Research Projects Agency (DARPA) Translingual Information 
Detection, Extraction, and Summarization (TIDES) project (2001 to 2005): The 
translation languages were Chinese to English and Arabic to English, and the target 
domain was news. This project was succeeded by the DARPA Global Autonomous 
Language Exploitation (GALE) project. 

e International Workshop on Spoken Language Translation (IWSLT) (2004 to 
present (2019)'): As of 2004 to 2007, speech translation of travel conversations 
was targeted. Several languages were included, including Japanese and English. 
The size of the training parallel corpus was 20,000 to 40,000 sentence pairs. 

e Workshop on Statistical Machine Translation (WMT) (2006 to present (2019) see 
footnote 1): Machine translation between European languages is the target. As 
of 2006 to 2007, the proceedings of the European Parliament and news were the 
target domains. 


As of 2007, research on SMT was in progress for several language pairs and fields. For 
the Japanese—English language pair, the domain covered in the evaluation workshops 
was travel conversations only. Because the sentence lengths were short and the topic 
was narrow, the shared task for travel conversation translation was technically easy. 
By contrast, there was no shared task for long sentence translation between Japanese 
and English, which is useful for advancing translation technology for long sentences 
between languages that differ significantly in word order. 

As adomain that includes long sentence translation between Japanese and English, 
patent translation has substantial demand, such as translation for foreign applications 
and translation of patents in foreign languages to understand the content of existing 
patents. The machine translation of patents has been required by sectors that produce 
and use intellectual property in countries and many companies. Therefore, if machine 
translation performs well for patent translation, there will be a substantial impact on 
society. 

In 2007, RBMT systems were on the market for the machine translation of patents 
between Japanese and English. Through years of research and development, RBMT 
systems have achieved translation quality at a level that is useful as a rough transla- 
tion for manual post-editing.” However, there was a barrier to further improving the 
translation quality of RBMT. Simply increasing the number of translation rules did 
not improve translation quality. Manually adding translation rules so that the appro- 
priate translation rules can be selected in accordance with the context from many 


'This chapter was written in 2019. Thus, this year does not indicate the final year. 

?In fact, when the organizers of NTCIR-7 asked Japanese—English translators to produce multi- 
reference translations of the test sentences, the organizers found that an RBMT system was used for 
rough translation, and the reference translations had to be retranslated to avoid the bias of a specific 
machine translation (MT) system (Fujii et al. 2008). 
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candidates has been a serious challenge that requires craftsmanship. It was also a 
serious challenge to make sentences generated by combining translation rules into 
natural sentences as written by a person. Moreover, both the accumulated amount of 
bilingual patent data and computational power could be expected to increase over 
time. Thus, to overcome the barriers to RBMT and aim for translation quality at 
the level of human translation, corpus-based machine translation technology, which 
automatically acquires translation knowledge and sentence generation knowledge 
from patent data, was required. However, before 2007, there were few studies on 
corpus-based machine translation for the patent field. 


7.2 Innovations at NTCIR 


As explained in the previous section, in 2007, to advance long sentence translation 
technology between languages differing greatly in word order, it was appropriate 
timing for shared tasks of patent translation between Japanese and English. At that 
time, the NTCIR-7 organizers extracted over one million Japanese—English parallel 
sentence pairs from parallel patent applications and launched the shared task of patent 
translation. This led to research on corpus-based machine translation for long patent 
sentences between Japanese and English. Patent translation tasks were conducted 
four times, from NTCIR-7 to NTCIR-10, over six years (Fujii et al. 2008, 2010; 
Goto et al. 2011, 2013). In NTCIR-9, the Chinese-English patent translation task 
was added. 

In the following, we present a summary of the comparison between SMT and 
RBMT for patent translation. 


e From the evaluation results of NTCIR-7 in 2008, the translation quality of RBMT 
was higher than that of SMT for Japanese—English and English-Japanese transla- 
tion. 

From the evaluation results of NTCIR-9 in 2011, the translation quality of SMT 
for English-Japanese caught up with that of RBMT. The translation quality of 
SMT for Chinese-English was higher than that of RBMT because the translation 
quality of RBMT was low. 

From the evaluation results of NTCIR-10 in 2013, SMT outperformed RBMT for 
English-Japanese translation. Although SMT could not catch up with RBMT for 
Japanese—English translation, the top SMT system for Japanese—English transla- 
tion at NTCIR-10 improved compared with the top SMT system at NTCIR-9. 


Thus, through four rounds of shared tasks over 6 years, the performance of SMT 
substantially improved for patent translation including long sentences for Japanese— 
English and English-Japanese, and Chinese-English. As a result, corpus-based 
machine translation could make it possible to overcome the challenges encountered 
by RBMT. This was the biggest innovation in the patent translation tasks. 

In the following, the purpose of each patent translation task is described, and an 
overview of each of the four tasks, major findings, and innovations is provided. The 
goals of the patent machine translation tasks were as follows: 
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to develop challenging and significant practical research into patent machine trans- 
lation; 

to investigate the performance of state-of-the-art machine translation in terms of 
patent translations involving Japanese, English, and Chinese; 

to compare the effects of different methods of patent translation by applying them 
to the same test data; 

to explore practical MT performance in real scenarios for patent machine transla- 
tion; 

to create publicly available parallel corpora of patent documents and human eval- 
uation of MT results for patent information processing research; 

to drive machine translation research, which is an important technology for the 
cross-lingual access to information written in unfamiliar languages; and 
ultimately, to foster scientific cooperation. 


7.2.1 Patent Translation Task at NTCIR-7 (2007-2008) 


As described in Sect. 7.1, 2007 was a time when SMT technology was progressing. 
Because there was an open-source SMT tool called Moses (Koehn et al. 2007) at 
that time, it was easy to conduct experiments on SMT if a bilingual parallel corpus 
was available. SMT could translate short sentences, such as travel conversations, to 
some extent. By contrast, the translation quality of SMT was low for long sentences 
between language pairs with a largely different word order. Therefore, translating a 
patent document that included long sentences between Japanese and English, which 
largely differ in word order, was a serious challenge for SMT. 

In 2007, the organizers constructed a Japanese—English parallel patent dataset 
that consisted of approximately 1.8 million parallel sentence pairs and launched 
the shared tasks of Japanese—English and English-Japanese patent translation. This 
was the first time that more than one million parallel sentence pairs in Japanese 
and English became widely available for research. The task organizers extracted the 
Japanese-English parallel patent sentence pairs from Japanese—English bilingual 
patent families. A patent family is a set of patents taken in more than one country to 
protect a single invention. The extraction of parallel sentence pairs was conducted 
by applying an automatic sentence alignment method (Utiyama and Isahara 2007) to 
approximately 85,000 patent families from 10 years of Japanese patents published 
by the Japan Patent Office (JPO) and 10 years of English patents published by the 
United States Patent and Trademark Office. 

In the NTCIR-7 patent translation task, human evaluation was performed. For 
Japanese—English translation, human evaluation was performed for a total of 15 
system outputs that consisted of the 14 system outputs submitted by the participating 
teams and a system output of the SMT tool Moses used by the organizers. The results 
showed that the automatic evaluation BLEU-4 score of SMT was higher than that 
of RBMT; however, in the human evaluation, the results indicated that the actual 
translation quality of RBMT was better than that of SMT. For English-Japanese 
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translation, human evaluation was performed for some representative systems, and 
the results showed that the trend of the comparison between SMT and RBMT was 
similar to that of Japanese—English translation. 

Additionally, the organizers compared the effect when English-Japanese machine 
translation was used for cross-lingual patent retrieval (CLPR) as an extrinsic evalua- 
tion. They used a standard retrieval method for CLPR. Because the standard retrieval 
method did not use the order of words in queries and documents, the order of words 
did not affect the retrieval results. The CLPR results were highly correlated with the 
BLEU score, and SMT was better than RBMT; that is, the results showed that SMT 
was more effective than RBMT in terms of translation word selection. 


7.2.2 Patent Translation Task at NTCIR-8 (2009-2010) 


The Japanese—English and English-Japanese patent translation tasks continued. The 
organizers expanded the size of the bilingual corpus by extracting parallel sentence 
pairs from 15 years of patent families, and provided the task participants with a 
Japanese—English parallel corpus that consisted of approximately 3.2 million sen- 
tence pairs. In the tasks, no purely RBMT system was included in the evaluation 
and no human evaluation was performed. Therefore, SMT and RBMT could not be 
compared. 

The system with the highest BLEU score for Japanese—English translation first 
translated Japanese sentences into English using RBMT, and then post-edited the 
translation results using SMT (Ehara 2010). The results showed that the word reorder- 
ing performance of SMT had not caught up with that of RBMT. Additionally, the 
shared task of the automatic evaluation of machine translation was also conducted 
using the human evaluation results of NTCIR-7. The task evaluated automatic eval- 
uation methods based on the human evaluation results. 


7.2.3 Patent Translation Task at NICIR-9 (2010-2011) 


The organizers? added a Chinese-English patent translation task in addition to the 
Japanese—English and English-Japanese patent translation tasks. Chinese-English 
translation is a globally required language pair and is popular in the machine trans- 
lation research community. For the Japanese—English and English-Japanese trans- 
lation tasks, the training dataset was the same as that of NTCIR-8, that is, approxi- 
mately 3.2 million sentence pairs, and the test dataset was newly produced. For the 
Chinese-English translation task, the organizers provided the task participants with 
a training dataset that consisted of one million parallel sentence pairs of Chinese— 
English bilingual patents. The organizers produced translation results using com- 


3The organizers of the patent translation task at NTCIR-9 changed from the organizers at NTCIR-8. 
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mercial RBMT systems to compare SMT and RBMT. They also performed human 
evaluation. Twenty-one teams around the world participated in the patent translation 
tasks. The introduction of the Chinese-English translation task led to the participa- 
tion of top international teams, such as BBN (Ma and Matsoukas 2011), IBM Watson 
Research (Lee et al. 2011), and RWTH Aachen University (Feng et al. 2011). 

The findings obtained from the evaluation results were as follows: For English— 
Japanese translation, the top SMT system achieved a translation quality equal to or 
better than that of the top RBMT system. For the first time in patent translation from 
English to Japanese, the top SMT system had caught up with the top RBMT system. 
The top SMT system improved substantially in translation quality by improving word 
reordering performance using a pre-ordering method (Sudoh et al. 2011). It became 
clear that separating word reordering from the decoding process could obtain a large 
effect in a simple manner. For Chinese-English translation, the translation quality of 
SMT was higher than that of RBMT because the performance of the Chinese-English 
RBMT systems was low. 

The organizers created and applied a new human evaluation criterion, that is, 
“Acceptability,” in addition to “Adequacy,” which is a conventional human evalu- 
ation criterion. The criteria for each grade of Adequacy were ambiguous, and the 
actual ratings were compared mainly on a relative basis to distinguish between the 
systems to be evaluated. Therefore, the translation quality was not necessarily the 
same for the same grade. For example, grade 3 when only low-level systems were 
evaluated and grade 3 when only high-level systems were evaluated would be differ- 
ent translation qualities. Thus, it was not possible to know the actual quality using 
such relatively scored grades. By contrast, Acceptability was defined as an objective 
and clearer standard, with the aim of making the quality of the same grade constant. 
The Acceptability results showed that the percentage of translated sentences that 
could convey all the meanings of the source sentences was 60% for the top systems 
for both Japanese—English and English-Japanese translation, and the percentage was 
80% for the top system for Chinese-English translation. 


7.2.4 Patent Translation Task at NTCIR-10 (2012-2013) 


The Japanese—English, English-Japanese, and Chinese-English patent translation 
tasks were continued at NTCIR-10. The training dataset was the same as that at 
NTCIR-9 and the test dataset was newly produced. Twenty-one teams participated 
in the tasks. 

The findings obtained from the evaluation results were as follows: For English— 
Japanese translation, the top SMT system (Sudoh et al. 2013) outperformed the 
RBMT systems in terms of translation quality. For Japanese—English translation, 
RBMT was still better than SMT; however, the translation quality of the top SMT 
system had improved from NTCIR-9 (Sudoh et al. 2013). For Chinese-English trans- 
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lation, the top system used neural networks in a language model to improve perfor- 
mance (Huang et al. 2013), and the effectiveness of neural networks for machine 
translation was thus demonstrated. 

If the test data was simply selected from the automatically extracted parallel cor- 
pus, biases, such as lengths or included expressions, may result. To reduce biases, the 
organizers selected test sentences using two methods. For one method, the organizers 
first calculated the distribution of sentence lengths in monolingual patent documents 
in the source language, and divided the cumulative length distribution into quartiles 
(25% each). Each quartile was called a sentence length class. Next, they classified the 
automatically aligned sentences in the source language into four classes according to 
their sentence lengths and extracted the same number of sentences from each class as 
test sentences. For the other method, the organizers randomly selected test sentences 
from all the description sentences in the source language patents for bilingual patents. 
Translators translated the test sentences to produce their reference translations. The 
data produced by the second method was used for the human evaluation. 

At NTCIR-9, the top systems performed well for sentence-level evaluations. 
Therefore, the NTCIR-10 organizers wanted to see how useful the top systems were 
for practical scenarios. Patent examination was one of the practical scenarios. The 
organizers performed Patent Examination Evaluation (PEE), which measures the 
usefulness of MT systems for patent examinations. PEE is described as follows: 
PEE assumes that the patent is examined in English. When a patent application in 
English is filed, an examiner examines existing patents and rejects the patent appli- 
cation if almost identical technology is described in an existing patent. If a patent 
application is rejected by referencing an existing patent, the examiner writes the 
final decision document (Shinketsu), which describes the facts about the existing 
patent on which the rejection is based. Assuming that the referenced patents were 
written in a foreign language, the organizers extracted the part that described the 
facts from the referenced patents and used the extracted sentences as test data. The 
test data in foreign languages (Japanese/Chinese) were translated into English using 
machine translation, and the translation results were evaluated according to whether 
the facts that were used to reject patent applications could be recognized from the 
translation result. PEE was performed by two experienced patent examiners. For 
Japanese—English translation, for the best system, all facts were recognized in 66% 
of referenced patents, and at least half of the facts were recognized in 100% of 
referenced patents. For Chinese-English translation, for the best system, all facts 
were recognized in 20% of referenced patents, and at least half of the facts were 
recognized in 88% of referenced patents. PEE achieved the evaluation of useful- 
ness in one representative practical scenario of patent machine translation. The PEE 
results and translations can be used as standards of usefulness in patent examination. 
Specifically, by comparing new translation results for the PEE test data with the PEE 
evaluated translations at NTCIR-10, their usefulness in patent examination for other 
systems can be assessed roughly. 
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7.3 Developments After NTCIR-10 


The evaluation workshop on Asian translation (WAT) for machine translation was 
launched in 2014. WAT targets machine translation between language pairs that 
include Asian languages. The activities of WAT have promoted the construction and 
sharing of research infrastructure for machine translation involving Asian languages. 
WAT features an open innovation platform. The test data and reference translations 
have been published with the training data, and the use of the same test data every 
year facilitates comparisons. In the following, we describe the activities and findings 
of WAT. 

In the first workshop (WAT 2014) (Nakazawa et al. 2014), the organizers set 
the shared tasks of scientific paper translation between Japanese and English, and 
between Japanese and Chinese. An SMT system using syntactic structures achieved 
the highest performance. 

In the second workshop (WAT 2015) (Nakazawa et al. 2015), in addition to the sci- 
entific paper translation tasks, Chinese—Japanese and Korean—Japanese patent trans- 
lation tasks were included. The size of the training dataset for each patent translation 
task was one million sentence pairs. The results showed that the translation quality 
of the top SMT system was higher than that of the RBMT systems for patent trans- 
lation for Chinese—Japanese and Korean—Japanese. For scientific paper translation, 
a reranking method using Neural Machine Translation (NMT) achieved the highest 
translation quality. The effectiveness of the scoring by NMT was thus demonstrated. 

In the third workshop (WAT 2016) (Nakazawa et al. 2016), Japanese—English and 
English-Japanese patent translation tasks were added. The size of the training dataset 
for each patent translation task was one million sentence pairs. For Japanese—English 
patent translation, the results confirmed that the translation quality of NMT and SMT 
outperformed the translation quality of RBMT. This was the first time that a corpus- 
based machine translation system yielded Japanese—English patent translation results 
comparable with those of RBMT systems. The translation quality of NMT evaluated 
by humans was higher than that of SMT for Japanese—English patent translation. 
For Japanese—English and English-Japanese scientific paper translation, pure NMT 
systems, not SMT reranking, achieved the best performance. In the field of machine 
translation, where large-scale parallel data was available, the mainstream technology 
for machine translation was changed from SMT to NMT. For English-Japanese 
patent translation, NMT achieved a translation quality close to that of the top SMT. 

In the fourth workshop (WAT 2017) (Nakazawa et al. 2017), news translation 
tasks between Japanese and English and recipe translation tasks between Japanese 
and English were added. In Japanese—English patent translation, the results showed 
that 86% of translated sentences conveyed all the meanings of the source sentences 
for the top NMT system, which was trained using ten million parallel sentence pairs 
in addition to the shared task data of one million parallel sentence pairs. By contrast, 
for Japanese—English news translation, 5% of translated sentences conveyed all the 
meanings for the top NMT system. This percentage is substantially lower than that 
of the top system for Japanese—English patent translation. The small size of the 
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training data was one of the reasons. An essential reason was that the quality of the 
parallel translation of news was lower than that of patents. The reason for the low 
quality of parallel translation of news compared with that of patents is as follows: 
In patent applications, because the content in Japanese is translated literally to make 
an English version of the patent to file as a patent family, the translation quality 
at the sentence level is high. By contrast, news translation is not only translation 
but news writing. In news writing, writers select the content in consideration of the 
difference between readers of news in the source language and readers of news in the 
target language, and writers edit articles to change the structure to that of an English 
news structure. Thus, even if the sentences are aligned in same-topic bilingual news 
articles in Japanese and English, the parallel translation quality at the sentence level 
is lower than that of patents. It was shown that the translation of news with low- 
quality parallel data was a challenge for machine translation. Additionally, in the 
Chinese—Japanese patent translation task, 62% of translated sentences conveyed all 
the meanings of the source sentences. The performance improved from 29% in the 
previous year. Chinese—Japanese patent translation is in high demand in Japan. 

In the fifth workshop (WAT 2018) (Nakazawa et al. 2018), the translation tasks 
between Myanmar and English, and between seven Indic languages and English were 
added. For Japanese—English scientific paper translation, the percentage of translated 
sentences that conveyed all the meanings of the source sentences improved from 34% 
in WAT 2017 to 61% in WAT 2018. 

We have outlined research trends in machine translation, including patent trans- 
lation from the activities of WAT. In the following, we describe other events. Google 
Translate changed from SMT to NMT in 2016. The change to NMT improved the 
translation quality, and people recognized the effectiveness of NMT. As a global 
trend, artificial intelligence (AI) technologies using deep learning have attracted 
attention since 2012. NMT is an AI technology. NMT’s translation quality first 
caught up with SMT’s translation quality in 2014, and NMT’s translation quality 
has improved each year. There were very rapid advances in translation quality in the 
four years from 2015 to 2018. 

Finally, we discuss the future of patent translation. Patent translation is an area in 
which large-scale high-quality parallel corpora are available. For example, a parallel 
corpus exists that contains over 100 million sentences.* Although machine transla- 
tion is not perfect, the translation quality of NMT will become close to translators 
for sentences without low-frequency words or new words as a result of training using 
a parallel corpus with the scale of 100 million sentence pairs. Because patent claims 
in Japanese have special styles, special pre-processing is necessary. The translation 
of sentences in claim sections is expected to be of high quality in the future. How- 
ever, the translation of low-frequency words and new words is a problem that is 
difficult to solve using a corpus-based mechanism alone, and another approach will 
be necessary. Methods that use subword units, such as byte pair encoding (Sennrich 
et al. 2016), alleviate this problem. However, the translation of low-frequency words 
whose elements are not compositional and low-frequency subwords is still a problem. 


4 ALAGIN JPO corpus https://alaginre.nict.go.jp/resources/jpo-info/jpo-outline.html. 
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There have been some studies on using automatically discovered bilingual words, 
and such techniques might be applied to NMT. Although machine translation may 
make errors, machine translation can do many things. Machine translation can be 
used for new translation needs that take advantage of its low cost and high speed. The 
patent offices of several countries, such as JPO, have already incorporated machine 
translation into their work. Machine translation has also been used in commercial ser- 
vices that provide foreign language patents in their customers’ preferred language. 
Machine translation of patents will be used in society as an indispensable tool to 
overcome the language barrier in intellectual property. 
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Chapter 8 A) 
Component-Based Evaluation for ciecie; 
Question Answering 


Teruko Mitamura and Eric Nyberg 


Abstract This chapter describes the component-based evaluation of automatic ques- 
tion answering (QA) systems, which was pioneered in the NTCIR-7 ACLIA chal- 
lenge and has became a fundamental part of QA system development, especially 
for difficult real-world datasets which require a multi-strategy, multi-component 
approach. We summarize the history of component evaluation for QA and describe 
more recent work at Carnegie Mellon (on TREC Genomics, BioASQ, and LiveQA 
datasets) which has descended directly from our experiences in NTCIR. 


8.1 Introduction 


In this chapter, we first describe the component-based evaluations for question 
answering that were developed as part of past NTCIR challenges. We introduce the 
CMU JAVELIN Cross-lingual Question Answering (CLQA) system and show how 
the JAVELIN architecture supports component-level evaluation, which can acceler- 
ate overall system development. This component-based evaluation concept was used 
in the NTCIR-7 ACLIA tasks, not only to evaluate each component but also to eval- 
uate different combinations of Information Retrieval (IR) and Question Answering 
(QA) modules. 

In later sections, we describe more recent developments in component-based 
evaluation within the Open Advancement of Question Answering (OAQA) and Con- 
figuration Space Exploration (CSE) projects. We also describe automatic component 
evaluation for biomedical QA systems. All of these later developments were influ- 
enced by the original vision of component-based evaluation embodied in the NTCIR 
QA tasks. To conclude, we discuss remaining challenges and future directions for 
component-based evaluation in QA. 
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8.1.1 History of Component-Based Evaluation in QA 


The JAVELIN Cross Language Question Answering (CLQA) system, developed by 
the Language Technologies Instutute (LTT) at Carnegie Mellon University (CMU) had 
five main components: question analysis, keyword translation, document retrieval, 
information extraction, and answer generation (Mitamura et al. 2007). This system 
contains an English-to-Japanese QA system and an English-to-Chinese QA system 
with the same overall architecture, which supported direct comparison of the two 
systems on a per-module basis. After analyzing the observed performance of each 
module on the evaluation data, we created gold-standard data (perfect input) for each 
module in order to determine upper bounds on module performance. The overall 
architecture is shown in Fig. 8.1. 

The Question Analysis (QA) module is responsible for parsing the input question, 
choosing the appropriate answer type, and producing a set of keywords. The Transla- 
tion Module (TM) translates the keywords into task-specific languages. The Retrieval 
Strategist (RS) module is responsible for finding relevant documents which might 
contain answers to the question, using translated keywords produced by the Trans- 
lation Module. The Information Extractor (IX) module extracts answers from the 
relevant documents. The Answer Generation (AG) module normalizes the answers 
and ranks them in order of correctness. 

Although traditional QA systems consist of several modules with a cascaded 
approach, as far as we know the JAVELIN CLQA system was the first one to incor- 
porate component-based evaluation for QA. We participated in the NTCIR-5 CLQA1 
task and demonstrated our results (Lin et al. 2005). A more detailed analysis of our 
component-based evaluation was presented at LREC 2006 (Shima et al. 2006). 


Question Retrieval 
Analyzer Strategist 


Translation 


Module Chinese 


Corpus 


Fig. 8.1 JAVELIN architecture 
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8.1.2 Contributions of NTCIR 


NTCIR first included a question answering challenge (QAC) evaluation for Japanese 
in 2002 (NTCIR-3). The NTCIR-4 and the NTCIR-S5 challenges continued to include 
QAC tasks in 2004 and 2005 respectively. The NTCIR-5 challenge also added the first 
cross-lingual QA task, which contained five subtasks for three languages: English, 
Japanese, and Chinese. The JAVELIN system was evaluated on the CLQA tasks for 
all three languages. When developing cross-lingual capabilities with three languages, 
system and component development became more complicated, and error analysis 
became very challenging. Therefore, we developed a component-based evaluation 
approach for error analysis and improvement of the JAVELIN CLQA system (Lin 
et al. 2005; Shima et al. 2006). 

Input questions in English are processed by these modules in the order listed above. 
The answer candidates are returned in one of the two target languages (Japanese 
and Chinese) as final outputs. The QA module is responsible for parsing the input 
question, choosing the expected answer type, and producing a set of keywords. The 
QA module calls the Translation Module, which translates the keywords into the 
language(s) required by the task. 

In order to gain different perspectives on the tasks and our system’s performance, 
a module-by-module analysis was performed. We used the formal run dataset from 
NTCIR task CLQAI, which includes English-Chinese (EC) and English-Japanese 
(EJ) subtasks. 200 input questions were provided for each of the subtasks. This 
analysis was based on gold-standard answer data, which also provides information 
about the documents that contain the correct answer for each question. We judged 
the QA module by the accuracy of its answer type classification, and the Translation 
Module by the accuracy of its keyword translation. For the RS and IX modules, 
if a correct document or answer is returned, regardless of its ranking, we consider 
the module to be successful. To separate the effects of errors introduced by earlier 
modules, we created gold-standard data by manually correcting answer type and 
keyword translation errors. We also create “perfect” IX input using the gold-standard 
document set. In Table 8.1, the overall performance (top 1 average accuracy) is shown 
in the last two columns of the top rows for EC and EJ. The symbol “R” indicates recall 
versus the standard gold answer set; the symbol “R+U” indicates recall versus the 
standard gold answer set plus other (unofficial) correct answers (“Unsupported”). 
If we examine only such global measures, we will not be able to understand the 
performance of individual modules in a complex system. 

Our analysis of per-module performance from gold-standard input shows that the 
QA module and the RS module are already performing fairly well, but there is still 
room in the IX module and the AG module for future improvement. 
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Table 8.1 Modular performance analysis (Shima et al. 2006) 


Gold standard AType TM RS IX MRR a Top 1 
input accuracy | accuracy | top 15 (%)| top 100 R+U (%) 
(%) (%) (%) 

EC None 86.5 69.3 30.5 30.0 0.130 9.5 
EC TM 86.5 - 57.5 50.0 0.254 20.0 
EC TM+AType — — 57.5 50.5 0.260 20.5 
EC TM+AType+RS 63.0 0.489 43.0 
EJ None 93.5 72.6 44.5 31.5 0.116 : 12.5 
EJ TM 93.5 - 67.0 41.5 0.154 9.5 15.0 
EJ TM+AType — — 68.0 45.0 0.164 10.0 15.5 
EJ TM+AType+RS | — — — 51.5 0.381 32.0 32.5 


8.2 Component-Based Evaluation in NTCIR 


In 2007, LTI/CMU became an organizer of Advanced Cross-lingual Information 
Access (ACLIA) task for NTCIR-7. In this task, we started the formal component- 
based evaluation for Japanese (JA), Simplified Chinese (CS), Traditional Chinese 
(CT), and English for the first time (Mitamura et al. 2008). There were two major 
tasks: (1) Information Retrieval for Question Answering (IR4QA) and (2) Complex 
Cross-Lingual Question Answering (CCLQA) tasks. Within the CCLQA task, we had 
three subtasks: Question Analysis track, CCLQA Main Track, and IR4AQA+CCLQA 
collaboration tracks (obligatory track and optional track). The ACLIA task data flow 
is illustrated in Fig. 8.2. 

As a central problem in question answering evaluation, the lack of standardiza- 
tion made it difficult to compare systems under a shared condition. In NLP research 
at that time, system design was moving away from monolithic, black-box architec- 
tures and more toward modular, architectural approaches that include an algorithm- 
independent formulation of the system’s data structures and data flows, so that multi- 
ple algorithms implementing a particular function can be evaluated on the same task. 
Therefore, the ACLIA data flow includes a pre-defined schema for representing the 
inputs and outputs of the document retrieval step, as illustrated in Fig. 8.2. This novel 
standardization effort made it possible to evaluate IR4¢QA (Information Retrieval for 
Question Answering) in the context of a closely related QA task. During the evalua- 
tion, the question text and QA system question analysis results were provided as input 
to the IR4QA task, which produced retrieval results that were subsequently fed back 
into the end-to-end QA systems. The modular design and XML interchange format 
supported by the ACLIA architecture made it possible to perform such embedded 
evaluations in a straightforward manner. 

The modular design of this evaluation data flow is motivated by the following 
goals: (a) to make it possible for participants to contribute component algorithms to 
an evaluation, even if they cannot field an end-to-end system; (b) to make it possible 
to conduct evaluations on a per-module basis, in order to target metrics and error 
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CCLQA Task 


Question CCLQA Main IR4QA+CCLQA 


Analysis Collaboration 


————> Reads or writes data 
---------- > Can take output from another system as input 


Fig. 8.2 Data flow in ACLIA task cluster showing how interchangeable data model made inter- 
system and inter-task collaboration possible (Mitamura et al. 2008) 


analysis on important bottlenecks in the end-to-end system; and (c) to determine 
which combination of algorithms works best by combining the results from various 
modules built by different participants. 


8.2.1 Shared Data Schema and Tracks 


In order to combine a Cross-Lingual Information Retrieval (CLIR) module with a 
cross-lingual Question Answering (CLQA) system for module-based evaluation, we 
defined five types of XML schema to support exchange of results among participants 
and submission of results to be evaluated: 


e Topic format: The organizer distributes topics in this format for formal run input 
to IR4QA and CCLQA systems. 

e Question Analysis format: CCLQA participants who chose to share Question 
Analysis results submit their data in this format. IR4QA participants can accept 
task input in this format. 

e IR4QA submission format: IR4QA participants submit results in this format. 
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e CCLQA submission format: CCLQA participants submit results in this format. 
e Gold-Standard Format: Organizer distributes CCLQA gold-standard data in this 
format. 


Participants in the ACLIA CCLQA task submitted results for the following four 
tracks: 


e Question Analysis Track: Question Analysis results contain key terms and answer 
types extracted from the input question. These data are submitted by CCLQA 
participants and released to IR4QA participants. 

CCLQA Main Track: For each topic, a system returned a list of system responses 
(i.e., answers to the question), and human assessors evaluated them. Participants 
submitted a maximum of three runs for each language pair. 

TR4QA+CCLQA Collaboration Track (obligatory): Using possibly relevant 
documents retrieved by the IR4QA participants, a CCLQA system-generated QA 
results in the same format used in the main track. Since we encouraged participants 
to compare multiple IR4QA results, we did not restrict the maximum number of 
collaboration runs submitted and used automatic measures to evaluate the results. 
In the obligatory collaboration track, only the top 50 documents returned by each 
IR4QA system for each question were utilized. 

TR4QA+CCLQA Collaboration Track (optional): This collaboration track was 
identical to the obligatory collaboration track, except that participants were able to 
use the full list of IR4QA results available for each question (up to 1000 documents 
per-topic). 


8.2.2 Shared Evaluation Metrics and Process 


In order to build an answer key for evaluation, third party assessors created a set 
of weighted nuggets for each topic. A “nugget” is defined as the minimum unit of 
correct information that satisfies the information need. 

In this section, we present the evaluation framework used in ACLIA, which is 
based on weighted nuggets. Both human-in-the-loop evaluation and automatic eval- 
uation were conducted using the same topics and metrics. The primary difference is in 
the step where nuggets in system responses are matched with gold-standard nuggets. 
During human assessment, this step is performed manually by human assessors, 
who judge whether each system response nugget matches a gold-standard nugget. 
In automatic evaluation, this decision is made automatically. The subsections that 
follow, we detail the differences between these two types of evaluation. 


8.2.2.1 Human-in-the-loop Evaluation Metrics 


In CCLQA, we evaluate how well a QA system can return answers that satisfy infor- 
mation needs on average, given a set of natural language questions. We adopted the 
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nugget pyramid evaluation method (Lin and Demner-Fushman 2006) for evaluating 
CCLQA results, which requires only that human assessors make a binary decision 
whether a system response matches a gold-standard “vital” nugget (necessary for 
the answer to be correct) or “ok” nugget (not necessary, but not incorrect). This 
method was used in the TREC 2005 QA track for evaluating definition questions, 
and in the TREC 2006-2007 QA tracks for evaluating “other” questions. We evalu- 
ated each submitted run by calculating the macroaverage F-score over all questions 
in the formal run dataset. 

In the TREC evaluations, a character allowance parameter C is set to 100 non- 
whitespace characters for English (Voorhees 2003). Based on the micro-average 
character length of the nuggets in the formal run dataset, we derived settings of C = 
18 for CS, C = 27 for CT and C = 24 for JA. 

Note that precision is an approximation, imposing a simple length penalty on the 
System Response (SR). This is due to Voorhees’ observation that “nugget precision 
is much more difficult to compute since there is no effective way of enumerating 
all the concepts in a response” (Voorhees 2004). The precision is a length-based 
approximation with a value of 1 as long as the total system response length per 
question is less than the allowance, i.e., C times the number of nuggets defined for 
a topic. If the total length exceeds the allowance, the score is penalized. Therefore, 
although there is no limit on the number of SRs submitted for a question, a long list 
of SRs harms the final F-score. 

The F (8 = 3 ) or simply F3 score has emphasizes recall over precision, with 
the 6 value of 3 indicating that recall is weighted three times as much as precision. 
Historically, a 6 of 5 was suggested by a pilot study on definitional QA evaluation 
(Voorhees 2003). In the later TREC QA tasks, the value has been to 3. 


8.2.2.2 Automatic Evaluation Metrics 


ACLIA also utilized automatic evaluation metrics for evaluating the large number of 
TIR4QA+CCLQA Collaboration track runs. Automatic evaluation is also useful dur- 
ing developing, where it provides rapid feedback on algorithmic variations under test. 
The main goal of research in automatic evaluation is to devise an automatic metric 
for scoring that correlates well with human judgment. The key technical requirement 
for automatic evaluation of complex QA is a real-valued matching function that pro- 
vides a high score to system responses that match a gold-standard answer nugget, 
with a high degree of correlation with human judgments on the same task. 

The simplest nugget matching procedure is exact match of the nugget text within 
the text of the system response. Although exact string match (or matching with simple 
regular expressions) works well for automatic evaluation of factoid QA, this model 
does not work well for complex QA, since nuggets are not exact texts extracted 
from the corpus text; the matching between nuggets and system responses requires a 
degree of understanding that cannot be approximated by a string or regular expression 
match for all acceptable system responses, even for a single corpus. 
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Fig. 8.3 Formulas of the 

binarized metric used for a BINARIZED -= >, max I, (n Sf ) 
official ACLIA automatic neNuggets seSRs 

evaluation (Mitamura et al. 

2008) 


1: NuggetRecall oken (735) > 0 
I,(n, 5) = 


0 : otherwise 


For the evaluation of complex questions in the TREC QA track, Lin and Demner- 
Fushman (2006) devised an automatic evaluation metric called POURPRE. Since 
the TREC target language was English, the evaluation procedure simply tokenized 
answer texts into individual words as the smallest units of meaning for token match- 
ing. In contrast, the ACLIA evaluation metric tokenized Japanese and Chinese texts 
into character unigrams. We did not extract word-based unigrams since automatic 
segmentation of CS, CT, and JA texts is non-trivial; these languages lack white space 
and there are no general rules for comprehensive word segmentation. Since a single 
character in these languages can bear a distinct unit of meaning, we chose to segment 
texts into character unigrams, a strategy that has been followed for other NLP tasks 
in Asian languages (e.g., Named Entity Recognition Asahara and Matsumoto 2003). 
One of the disadvantages of POURPRE is that it gives a partial score to a system 
response if it has at least one common token with any one of the nuggets. To avoid 
over-estimating the score via aggregation of many such partial scores, we devised a 
novel metric by mapping the POURPRE soft match score values into binary values 
(see Fig. 8.3). We set the threshold 6 to be somewhere in between no match and an 
exact match, i.e., 0.5, and we used this BINARIZED metric as our official automatic 
evaluation metric for ACLIA. 

Reliability of Automatic Evaluation: We compared per-run (# of data points 
= # of human evaluated runs for all languages) and per-topic (# of data points = # 
of human evaluated runs for all languages times # of topics) correlation between 
scores from human-in-the-loop evaluation and automatic evaluation. The following 
Table 8.2 from the ACLIA Overview (Mitamura et al. 2008) shows that the correlation 
between the automatic and human evaluation metrics. 

The Pearson measure indicates the correlation between individual scores, while 
the Kendall measure indicates the rank correlation between sets of data points. The 
results show that our novel nugget matching algorithm BINARIZED outperformed 
SOFTMATCH for both correlation measures, and we chose BINARIZED as the 
official automatic evaluation metric for the CCLQA task. 
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Table 8.2 Per-run and per-topic correlation between automatic nugget matching and human judg- 
ment (Mitamura et al. 2008) 


Algorithm Token Per-run Per-run Per-topic Per-topic 
(N = 40) (N = 40) (N= (N= 
40 x 100) 40 x 100) 
Pearson Kendall Pearson Kendall 
Exactmatch | Char 0.4490 0.2364 0.5272 0.4054 
Softmatch Char 0.6300 0.3479 0.6383 0.4230 
Binarized Char 0.7382 0.4506 0.6758 0.5228 


8.3 Recent Developments in Component Evaluation 


The introduction of modular QA design and component-based QA evaluation by 
NTCIR had a strong influence on subsequent research in applied QA systems. In 
this section, we summarize key developments in QA research that followed directly 
from our experiences with NTCIR. 


8.3.1 Open Advancement of Question Answering 


Shared modular APIs and common data exchange formats have become fundamental 
requirements for general language processing frameworks like UIMA (Ferrucci et al. 
2009a) and specific language applications (like the Jeopardy! Challenge) ( Ferrucci 
et al. 2010). In 2009, a group of academic and industry researchers published a tech- 
nical report on the fundamental requirements for the Open Advancement of Question 
Answering (OAQA) (Ferrucci et al. 2009b); chief among these requirements are the 
shared modular design, common data formats, and automatic evaluation metrics first 
introduced by NTCIR: 


To support this vision of shared modules, dataflows, and evaluation measures, an open 
collaboration will include a shared logical architecture—a formal API definition for the 
processing modules in the QA system, and the data objects passed between them. For any 
given configuration of components, standardized metrics can be applied to the outputs of 
each module and the end-to-end system to automatically capture system performance at the 
micro and macro level for each test or evaluation. (Ferrucci et al. 2009b) 


By designing and building a shared infrastructure for system integration and evaluation, we 
can reduce the cost of interoperation and accelerate the pace of innovation. A shared logical 
architecture also reduces the overall cost to deploy distributed parallel computing models to 
reduce research cycle time and improve run-time response. (Ferrucci et al. 2009b) 


A group of eight universities followed these principles in collaborating with IBM 
Research to develop the Watson system for the Jeopardy! challenge (Andrews 2011). 
The Watson system utilized a shared, modular architecture which allowed the explo- 
ration of many different implementations of question-answering components. In 
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particular, hundreds of components were evaluated, as part of an answer-scoring 
ensemble that was used to select Watson’s final answer for each clue (Ferrucci et al. 
2010). 

Following the success of the Watson system in the Jeopardy! Challenge (where 
the system won a tournament against two human champions, Ken Jennings and Brad 
Rutter), Carnegie Mellon continued to refine the OAQA approach and engaged with 
other industrial sponsors (most notably, Hoffman-Laroche) to develop open-source 
architectures and solutions for question answering (discussed below). 


8.3.2 Configuration Space Exploration (CSE) 


In January of 2012, Carnegie Mellon launched a new project on biomedical question 
answering, with support from Hoffman-Laroche. Given the goal of building a state- 
of-the-art QA system for a current dataset (at that time, the TREC Genomics dataset), 
the CMU team chose to survey and evaluate published approaches (at the level of 
architecture and modules) to determine the best baseline solution. This triggered a 
new emphasis on defining and exploring a space of possible end-to-end pipelines 
and module combinations, rather than selecting and optimizing a single architecture 
based on preference, convenience, etc. The Configuration Space Exploration project 
(Gardufio et al. 2013) explored the following research questions (taken from Yang 
et al. 2013): 


e How can we formally define a configuration space to capture the various ways of 
configuring resources, components, and parameter values to produce a working 
solution? Can we give a formal characterization of the problem of finding an 
optimal configuration from a given configuration space? 

Is it possible to develop task-independent open-source software that can easily 
create a standard task framework and incorporate existing tools and efficiently 
explore a configuration space using distributed computing? 

Given a real-world information processing task, e.g., biomedical question answer- 
ing, and a set of available resources, algorithms, and toolkits, is it possible to write 
a descriptor for the configuration space, and then find an optimal configuration in 
that space using the CSE framework? 


The CSE concept of operations is shown in Fig. 8.4. Given a labeled set of input- 
output pairs (the information processing task), the system searches a space of possible 
solutions (algorithms, toolkits, knowledge bases, etc.) using a set of standard bench- 
marks (metrics) to determine which solution(s) have the best performance over all 
the inputs in the task. The goal of CSE is to find an optimal or near-optimal solution 
while exploring (formally evaluating) only a smart part of the total configuration 
space. 

Based on a shared component architecture and implemented in UIMA, the Con- 
figuration Space Exploration (CSE) project was the first to automatically choose an 
optimal configuration from a set of QA modules and associated parameter values, 
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given a set of labeled training instances (Garduño et al. 2013). As part of his Ph.D. 
thesis at Carnegie Mellon, Zi Yang applied the CSE framework to several biomedi- 
cal information processing problems (Yang 2017). In the following subsection, we 
discuss the main results of component evaluation for biomedical QA systems. 


8.3.3 Component Evaluation for Biomedical QA 


Using the Configuration Space Exploration techniques described in the previous 
subsection (Garduño et al. 2013), a group of researchers at CMU were able to auto- 
matically identify a system configuration which signficantly outperformed published 
baselines for the TREC Genomics task (Yang et al. 2013). Subsequent work showed 
that it was possible to build high-performance QA systems by applying this opti- 
mization approach to an ensemble of subsystems, for the related set of tasks in the 
BioASQ challenge (Yang et al. 2015). 

Table 8.3 shows a summary of the different components that were evaluated for 
the TREC genomics task: various tokenizers, part-of-speech taggers, named entity 
recognizers, biomedical knowledge bases, retrieval tools, and reranking algorithms. 
As shown in Fig. 8.4, the team evaluated about 2,700 different end-to-end configu- 
rations, executing over 190K test examples in order to select the best-performing 
configuration (Table 8.4). After 24 hours of clock time, the system (running on 30 
compute nodes) was able to find a configuration that significantly outperformed the 
published state of the art on the 2006 TREC Genomics task, achieving a document 
MAP of 0.56 (versus a published best of 0.54) and a passage MAP of 0.18 (versus 
a published best of 0.15). Table 8.5 shows the analogous results for the 2007 TREC 


Table 8.3 Summary of components integrated for TREC Genomics. (Yang et al. 2013) 


Category Components 

NLP tools LingPipe HMM-based tokenizer 

LingPipe HMM-based POS tagger 

LingPipe HMM-based named entity recognizer 
Rule-based lexical variant generator 


KBs UMLS for syn/acronym expansion 
EntrezGene for syn/acronym expansion 
MeSH for syn/acronym expansion 


Retrieval tools Indri system 


Reranking algorithms Important sentence identification 
Term proximity-based ranking 


Score combination of different retrieval units 


Overlapping passage resolution 
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Configuration space exploration framework 


Information Configuration : Component Dynamic |: Optimal 
processing —> space = characteristic > configuration +> information 
task specification modeling selection system 

Component 


Ol. 


Algorithms Toolkits Knowledge bases Benchmarks 


pool 


Fig. 8.4 Overview of configuration space exploration framework architecture (Yang et al. 2013) 


Table 8.4 Performance of automatically configured components (CSE) versus TREC Genomics 


2006 participants (Yang et al. 2013) 


TREC 2006 CSE 
No. components 1,000 12 
No. configurations 1,000 32 
No. traces 92 2,700 
No. executions 1,000 190,680 
Capacity (hours) N/A 24 
DocMAP max 0.5439 0.5648 
DocMAP median 0.3083 0.4770 
DocMAP min 0.0198 0.1087 
PsgMAP max 0.1486 0.1773 
PsgMAP median 0.0345 0.1603 
PsgMAP min 0.0007 0.0311 


Table 8.5 Performance of automatically configured components versus TREC Genomics 2007 
participants (Yang et al. 2013) 


TREC 2007 CSE 
DocMAP max 0.3286 0.3144 
DocMAP median 0.1897 0.2480 
DocMAP min 0.0329 0.2067 
PsgMAP max 0.0976 0.0984 
PsgMAP median 0.0565 0.0763 
PsgMAP min 0.0029 0.0412 
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Fig. 8.5 Modular architecture and components for BioASQ phase B (Yang et al. 2015) 


Genomics Task, where CSE was also able to find a significantly better combination 
of components. 

The positive results from applying CSE to the TREC Genomics tasks were 
extended by applying CSE to a much larger, more complex task with many sub- 
tasks: The BioASQ Challenge (Chandu et al. 2017; Yang et al. 2015, 2016). Using 
a shared corpus of biomedical documents (PubMed articles), the BioASQ organiz- 
ers created a set of interrelated tasks for question answering: retrieval of relevant 
medical concepts, articles, snippets and RDF triples, plus generation of both exact 
and “ideal” (summary) answers for each question. Figure 8.5 illustrates the modu- 
lar architecture used to generate exact answers for 2015 BioASQ Phase B (Yang 
et al. 2015). Across the five batch tests in Phase B, the CMU system achieved top 
scores in concept retrieval, snippet retrieval, and exact answer generation. As shown 
in Fig. 8.5, this involved evaluating and optimizing ensembles of language models, 
named entity extractors, concept retrievers, classifiers, candidate answer generators, 
and answer scorers. 
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8.4 Remaining Challenges and Future Directions 


Much recent work in question-answering has focused on neural models which are 
trained on large numbers of question-answer pairs created by human curators (e.g., 
SQUAD (Rajpurkar et al. 2016), SQUAD 2 (Rajpurkar et al. 2018). While neural 
QA approaches are effective when large numbers of labeled training examples are 
available (e.g., more than 100,000 examples), in practice neural approaches are very 
sensitive to the distribution of answer texts and corresponding questions that are 
created by the human curators. For example, a recent study showed that an advanced 
question curation strategy, using the original answer texts from SQUAD produced 
a dataset (ParallelQA) that was much tougher for neural models; models evaluated 
on SQUAD and ParallelQA did approximately 20% worse on ParallelQA (Wadhwa 
et al. 201 8c). In the future, we believe that QA research must focus more energy on 
defining effective curation strategies, so that the best components and models may be 
chosen and built into an effective solution using the least amount of labeled data and 
human resources. In preliminary work, we have adopted a comparative evaluation 
framework (Wadhwa et al. 2018a) that allows us to compare the performance of 
different neural QA approaches across datasets, in order to identify the approach 
with the most general capability. 

It is also the case that neural approaches to QA often assume that a single neural 
model or an ensemble of neural models will produce an effective solution. In reality, 
it is difficult for any one model to learn all of the varied ways in which answers 
correspond to questions presented by the user. Due to the high cost of training and 
evaluating neural models, researchers often don’t consider more sophisticated combi- 
nations of models, or ensembles with non-neural components. This movement away 
from the multi-strategy, multi-component approach that reached its zenith in IBM 
Watson is unfortunate, because it has focused the QA field on just a few, artificially 
created datasets that are comparatively easy for neural QA approaches. 

It is ironic that the best-performing automatic QA system in the LiveQA evalua- 
tions (Wang and Nyberg 2015b, 2016, 2017) combined sophisticated neural models 
with an optimized version of the classic BM25 algorithm; neither the neural model 
nor BM25 was competitive by itself, but the combination of these two algorithms 
provided the most effective solution for the Yahoo! Answers data set. While it is 
true that curating datasets which can be solved by neural methods has stimulated the 
development of more capable, sophisticated neural models, neural approaches still 
rely on hundreds of thousands of labeled examples, and do not perform well when 
(a) there is limited training data, (b) there is a large variance in the lengths of the 
question versus answer texts, and (c) there is little lexical overlap between question 
and answer texts (Wadhwa et al. 2018b, c). 
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8.5 Conclusion 


As we have discussed in this chapter, the development of common interchange for- 
mats for language processing modules in the JAVELIN project (Lin et al. 2005; Mita- 
mura et al. 2007; Shima et al. 2006) led to the use of common schemas in the NTCIR 
IR4QA embedded task (Mitamura et al. 2008), which we believe is the first example 
of a common QA evaluation using a shared data schema and automatic combination 
runs. Although it is expensive to use human evaluators to judge all possible combi- 
nations of systems, automatic metrics (such as ROUGE) can be used to find novel 
combinations that seem to perform well or better than the state of the art; this subset 
of novel systems can then be evaluated by humans. In the OAQA project (which fol- 
lowed JAVELIN at CMU), development participants began to create gold-standard 
datasets that include expected outputs for all stages in the QA pipeline, not just the 
final answer (Garduño et al. 2013). This allowed precise automatic evaluation and 
more effective error analysis, leading to the development of high-performance QA 
incorporating hundreds of different strategies in real time (IBM Watson) (Ferrucci 
et al. 2010). The OAQA approach was also used to evaluate and optimize several 
multi-strategy QA systems, some of which achieved state-of-the-art performance on 
the TREC Genomics datasets (2006 and 2007) (Yang et al. 2013) and BioASQ tasks 
(2015-2018) (Chandu et al. 2017; Yang et al. 2015, 2016). 

Although academic datasets in the QA field have recently focused on specific parts 
of the QA task (such as answer sentence and answer span selection) (Rajpurkar et al. 
2016, 2018) which can be solved by a single deep learning or neural architecture, 
systems which achieve state-of-the-art performance on messy, real-world datasets 
(such as Jeopardy! or Yahoo! Answers) must employ a multi-strategy approach. For 
example, neural QA components were combined with classic information-theoretic 
algorithms (e.g., BM25) to achieve the best automatic QA system performance on the 
TREC LiveQA task (2015-2017) (Wang and Nyberg 2015a,b, 2016, 2017), which 
was based on a Yahoo! Answers community QA dataset. It is our expectation that 
a path to more general QA performance will be found by upholding the tradition 
of multi-strategy, multi-component evaluations pioneered by NTCIR. In our most 
recent work, we have tried to extend the state of the art in neural QA by performing 
comparative evaluations of different neural QA architectures across QA datasets 
(Wadhwa et al. 2018a), and we expect that future work will also focus on how to 
curate the most challenging (and realistic) datasets for real-world QA tasks (Wadhwa 
et al. 2018c). 
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Chapter 9 A) 
Temporal Information Access ciecie; 


Masaharu Yoshioka and Hideo Joho 


Abstract This chapter introduces the research background and details of temporal 
information access tasks in the NTCIR. The GeoTime task was the first attempt to 
evaluate temporal information retrieval as an extension of an information-retrieval- 
for-question-answering task. Temporalia was a task to investigate the role of temporal 
factors in a search. 


9.1 Introduction 


Temporal information is important to understand the document and to represent users’ 
information needs. In the early age of Named Entity Recognition (NER), tasks such 
as MUC-6 (Sundheim 1995) and IREX (Sekine and Isahara 2000), date and time were 
selected as categories for NER. In information access technology research, there had 
been several studies on using such temporal information (e.g., Mani et al. 2004), but 
there have not been many studies on temporal information retrieval (Alonso et al. 
2007). 

Compared to the usage of temporal information, Geographical Information 
Retrieval (Geographic Information Retrieval (GIR)) had attracted more researchers, 
and a series of workshops on Geographic Information Retrieval (GIR) was started 
in 2004 (Purves and Jones 2004). In this series of workshops, temporal information 
was only discussed as a related topic of the task. 

At NTCIR-8, GeoTime (geographic and temporal information retrieval) tasks 
(Gey et al. 2010) were launched as first attempts to construct a test collection for 
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temporal information retrieval. This task was designed as an extension of IR4QA 
tasks (Mitamura et al. 2008). There were two types of temporal-related queries. One 
query type asked for temporal information (““when” question), while the other query 
type used temporal information as constraints (winning team of Superbowl in 2002). 
Details of the information related to the task are discussed in Sect. 9.2. 

Following the success of GeoTime Tasks in NTCIR-8 and 9, a new task was pro- 
posed to further investigate the role of temporal factors in the search. The task was 
called Temporalia (Temporal Information Access) and was run twice in NTCIR-11 
and 12. One of the important innovations in Temporalia was to provide a test col- 
lection that allowed researchers to examine the performance of time-aware search 
applications using categories such as past, recent, future, and atemporal rather than 
focusing on recency queries. Details of the information related to the task are dis- 
cussed in Sect. 9.3. 


9.2 Temporal Information Retrieval 


There are several IR applications that utilize temporal information; e.g., ad hoc 
retrieval, hit-list clustering based on the temporal aspect, exploratory search, and 
visualization of results based on the temporal relationships (Alonso et al. 2007). 
However, there was no IR evaluation campaign for temporal information retrieval 
except for some discussions related to Geographic Information Retrieval (GIR) 
(Purves and Jones 2004). 

To utilize and incorporate the discussion related to Geographic Information 
Retrieval (GIR), GeoTime (geographic and temporal information retrieval) tasks (Gey 
et al. 2010) were launched at NTCIR-8 as an extension of IR4QA for handling spatial 
and temporal-related queries (Mitamura et al. 2008). 


9.2.1 NTCIR-8 GeoTime Task 


The NTCIR-8 GeoTime Task was designed as an IR4QA task for the geographical 
and temporal-related queries. 

Parts of queries were constructed using the information of notable events listed 
in Wikipedia,' and several queries were derived from the ACLIA collection (Sakai 
et al. 2010). This task used the New York Times collection for the English document 
database and the Mainichi Japanese newspaper collection for the Japanese document 
database. 

For the evaluation, because most of the queries have both temporal and spatial 
aspects, the articles that can be used for answering questions for temporal and spatial 
aspects were categorized as “fully relevant” and ones that can answer only one aspect 


lFor example, notable events in 2002 are listed at https://en.wikipedia.org/wiki/2002. 
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(temporal or spatial) are categorized as “partially relevant”. The submitted results 
were evaluated by the same schemes used for the ACLIA IR4QA collection (Sakai 
et al. 2010). 

The following are examples of the queries. 


e How old was Max Schmeling when he died, and where did he die? 
e When and where did a massive earthquake occur in December 2003? 


The former question asks for temporal information using a “when” question. 
The latter question also has the “when” question style, but it also uses temporal 
information to represent constraints (“in December 2003”). 

There were 14 teams that participated in NTCIR-8 GeoTime (8 and 7 teams sub- 
mitted runs for Japanese and English runs, respectively) using various approaches 
(Gey et al. 2010). The baseline system utilized ordinary ad hoc IR systems such as 
probabilistic IR with blind relevance feedback. This baseline system worked well 
for the English run but underperformed in the Japanese run. Another approach uti- 
lized a NER system and/or geographic resources to extract named entity information 
including geographic and temporal information from the queries and documents. The 
best performing NTCIR-8 Japanese run was a hybrid approach that combined the 
probabilistic approach and weighted Boolean query formulation based on the NER 
results (Yoshioka 2010). There were approaches that focused on geographic infor- 
mation including the hierarchical relationship among location names (e.g., Tokyo is 
a part of Japan) and the distance between the extracted location of the query and 
document, and there were several discussions about the temporal information. 

Another approach emphasized the style of the query in GeoTime. Because the 
query was provided as a question in IR4QA style, the relevant documents should 
contain the information for its answer. Based on this understanding, one team counted 
the number of temporal or geographic mentions that can be candidates for the answer 
for re-ranking (Kishida 2010). Another approach decomposes the question into one 
for geographic information and another for temporal information. After decomposing 
the question, they used a factoid question answering system to determine the answer 
and utilize its information for constructing new queries (Mori 2010). However, those 
approaches did not perform well for the task. 

From the analysis of the difficult queries based on the evaluation of the submitted 
results, two types of difficult queries were identified. One type is that the system 
tends to misinterpret the constraint of the query. An example of the query is “When 
and where were the 2010 Winter Olympics host city location announced?”. In this 
question, “2010” is used as a part of an event name and not as a constraint-specifying 
articles should be selected from those published in 2010 or after. Another type of 
difficult query requires a list of events to determine relevant articles. An example of 
this type of query is “When and where were the last three Winter Olympics held?”. 
It is difficult to retrieve relevant articles without generating an event list that satisfies 
the query constraint. Details of the discussion about the difficulties of the problem 
are addressed in Sect. 9.2.3. 
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9.2.2 NTCIR-9 GeoTime Task Round 2 


By comparing the English runs and Japanese runs, there were queries that have 
large performance variability for the same topics. Therefore, the news article data 
for English runs were expanded to include newspapers from different countries. In 
addition to the news articles of the New York Times collection, English versions of 
Korean Times (Korea), Mainichi (Japan), and Xinhua (China) were used to construct 
a document database. 

There were 12 teams that participated for NTCIR-9 GeoTime (5 and 9 teams sub- 
mitted runs for the Japanese and English runs, respectively) using various approaches 
(Gey et al. 2011). One large difference from the previous GeoTime was the usage 
of external resources such as Yahoo PlaceMaker, Wikipedia, DBpedia, Geonames, 
Google Maps, and the Alexandria Digital Library gazetteer. Most of the teams uti- 
lized such information for improving the retrieval results related to the geographic 
queries. However, the query that required reverse geocoding (finding place names 
from a latitude/longitude information) was not appropriately handled except that the 
team manually extracted the related event name using Wikipedia. 

The best performing team for both Japanese and English runs used manual query 
expansion with a related event name and/or name of the location using Wikipedia and 
Google Maps (Sato 2011). Because this approach was not automatic, it was difficult 
to compare this result with others. However, this result suggested that the extraction 
of such related event names or locations is crucial for improving the recall of the 
related articles. 


9.2.3 Issues Discussed Related to Temporal IR 


One of the difficult queries in NTCIR-8 GeoTime was “When and where were the 
2010 Winter Olympics host city location announced?”. To discuss the difficulties of 
this query, it was necessary to discuss the types of temporal expression. Alonso et al. 
(2007) proposed the following types of temporal expression. 


1. Explicit. Temporal expression directly describes its information (e.g., September 
11, 2001). 

2. Implicit. There is imprecise temporal information, such as names of holidays or 
events. It is possible to extract temporal information using knowledge about such 
holidays or events (e.g., Labor Day, 2001, can be mapped to September 1, 2001, 
and Vancouver Winter Olympics can be mapped to February 2010). 

3. Relative. Temporal expressions represent temporal entities that refer to other tem- 
poral entities. Temporal information resolution is necessary to extract its tempo- 
ral information (e.g., “yesterday” of the news article published on September 12, 
2001, can be mapped to September 11, 2001). 


In the query discussed above, “the 2010 Winter Olympics” is the name of the 
event and can be treated as an implicit temporal expression. However, it is not a 
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constraint for selecting relevant articles. It is necessary to have a mechanism to 
select which kinds of temporal expression should be used for constraints to retrieve 
relevant articles. 

Another problem is related to handling the relationship between temporal infor- 
mation and event names that represent imprecise temporal information. An example 
of this difficult query is “When and where were the last three Winter Olympics 
held?”; “the last three” uses relative and imprecise temporal information to select 
relevant event names (three Winter Olympic event names). Because most of the rel- 
evant documents contain such event names but do not have such relative expression, 
it is difficult to retrieve such articles without event names. As we confirmed in the 
case of NTCIR-9 GeoTime, query expansion using such event names significantly 
improves the performance. 


9.3 Temporal Query Analysis and Search Result 
Diversification 


To facilitate research on temporal information access, Temporalia-1 in NTCIR-11 
(Joho et al. 2014) focused on each of the four categories in a structured way, while 
Temporalia-2 in NTCIR-12 (Joho et al. 2016) was designed to encourage researchers 
to explore ways to combine the four categories in a meaningful way. Both were 
designed to address the temporal ambiguity and diversity of the search space. 


9.3.1 NTCIR-11 Temporal Information Access Task 


Temporalia-1 at NTCIR-11 consisted of two subtasks: Temporal query intent clas- 
sification and temporal information retrieval. 


9.3.1.1 Temporal Query Intent Classification 


The Temporal Query Intent Classification (TQIC) subtask was used to classify a 
given query into one of the following classes: past, recency, future, and atemporal. 
Example queries are ground truth temporal classes are shown in Table 9.1. The classes 
were defined as follows. 

Past: class characterizing queries about past entities/events whose search results 
are not expected to change much with the passage of time. 

Recency: class characterizing queries about recent entities/events, whose search 
results are expected to be timely and up to date. The information contained in the 
search results usually changes quickly with the passage of time. Note that this type 
of query usually refers to events that happened in the near past or at the present time. 
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Table 9.1 Example queries and ground truth temporal classes for the TQIC subtask (dry run) 


Query class Query example 

Past Price hike in Bangladesh 2008 
Past Who was Martin Luther 

Past When did the Titanic sink 
Past Yuri Gagarin cause of death 
Past History of Coca-Cola 
Recency Apple stock price 

Recency Number of millionaires in USA 
Recency Time in London 

Recency Trendy plus size clothing 
Recency Did the Pirates win today 
Future 2013 MLB playoff schedule 
Future Release date for iOS7 

Future College baseball regional projections 
Future Disney prices 2014 

Future Long-term weather forecast 
Atemporal Blood pressure monitor 
Atemporal Distance from earth to sun 
Atemporal How to start a conversation 
Atemporal New York Times 

Atemporal Lose weight quickly 


In contrast, the “past” query category tends to refer to events in a relatively distant 
past. 

Future: class characterizing queries about predicted or scheduled events, and the 
search results of which should contain future-related information. 

Atemporal: class characterizing queries without any clear temporal intent (i.e., 
their returned search results are not expected to be related to time and should not 
change much over time). Navigational queries are considered to be atemporal. 

Participants were handed a set of query strings and query submission dates and 
were asked to develop a system to classify each of the query strings to one of the 
four above-mentioned temporal classes. As this problem rather requires different 
kinds of knowledge (e.g., historical information or information on planned events), 
the participants were allowed to use any external resources to complete the TQIC 
subtask as long as the details of external resource usage were described in their 
reports. Each participating team was asked to submit a temporal class (past, recency, 
future, or atemporal) for each one of the queries. The performance of submitted runs 
was measured by the number of queries with correct temporal classes divided by the 
total number of queries. 
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Table 9.2 Example topics for the TIR subtask (dry run) 
Girl with the Dragon Tattoo 


I have recently watched a film called Girl with 
the Dragon Tattoo and I really liked it. 
Therefore, I would like to gather information 
about the movie 


Description 


Past question How did the casting of the film develop? 


Recency question What did the recent reviews say about the film? 


Future question Is there any plan about a sequel? 


What are the names of the main actors and 
actresses of the film? 


Search date 28 Feb 2013 GMT+0:00 


Atemporal question 


9.3.1.2 Temporal Information Retrieval 


The Temporal Information Retrieval (TIR) subtask was used to retrieve a set of 
documents in response to a search topic that incorporates a time factor. In addition 
to a typical search topic description (i.e., title, description, and subtopics), the TIR 
search topic description also contains a query submission date (see Table 9.2). This 
subtask required indexing of the document collection with any standard information 
retrieval toolkit. Participants were asked to submit the top 100 documents for each 
temporal question per topic (e.g., top 100 documents for a past question and another 
100 for arecency question). The retrieval effectiveness was evaluated by the precision 
at 20 for each of the temporal questions. Similar to the TQIC subtask, the results 
section presents an analysis of the performance across temporal questions. 


9.3.2 NTCIR-12 Temporal Information Access Task Round 2 


Temporalia-2 at NTCIR-12 also consisted of two subtasks: temporal intent disam- 
biguation and temporally diversified retrieval. 


9.3.2.1 Temporal Intent Disambiguation 


The Temporal Intent Disambiguation (TID) subtask determined a probability dis- 
tribution of a query over four classes denoting the types of temporal intent: past, 
recency, future, and atemporal. The definitions of the four classes were based on 
TQIC in Temporalia-1. An example of the probability distribution of temporal intents 
is shown in Tables 9.3. 
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Table 9.3 Example queries for the TID subtask (dry run) with query submission date of May 1, 
2013. Ground truth probability of temporal intents was determined by votes from crowd workers 


Query Past Recency Future Atem. 
Australian open | 0.091 0.0 0.455 0.455 
Motorcycle 0.7 0.0 0.3 0.0 
accident June 

NBA finals 0.1 0.0 0.4 0.5 
NBA playoff 0.0 0.2 0.6 0.2 
schedule 

Price of oil 0.0 0.9 0.0 0.1 
How to lose 0.0 0.1 0.0 0.9 
weight 

Time in India 0.0 1.0 0.0 0.0 
History of 1.0 0.0 0.0 0.0 
volleyball 


Table 9.4 Example topics for the TDR subtask 
Junk food health effect 


Description I am concerned about the health effects of junk 
food in general. I need to know more about 
their ingredients, impact on health, history, 
current scientific discoveries, and any 


prognoses 

Past question When did junk foods become popular? 

Recency question What are the latest studies on the effect of junk 
foods on our health? 

Future question Will junk food continue to be popular in the 
future? 

Atemporal question How junk foods are defined? 

Search date 29 May 2013 GMT+0:00 


9.3.2.2 Temporally Diversified Retrieval 


The Temporally Diversified Retrieval (TDR) subtask required participants to retrieve 
a set of documents relevant to each of four temporal intent classes for a given topic 
description (see Table 9.4). Participants were also asked to return a set of docu- 
ments that is temporally diversified for the same topic. They received a set of topic 
descriptions, query issuing times, and indicative search questions for each temporal 
class (past, recency, future, and atemporal). The objective of the indicative search 
questions was to show one possible subtopic under a particular temporal class. Par- 
ticipants were asked to develop systems that can produce a total of five search results 
per topic (past, recency, future, atemporal, and diversified). 
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9.3.3 Implications from Temporalia 


This section discusses the implications of Temporalia tasks on system development 
and test collection, respectively. 


9.3.3.1 Implications on System Development 


From the meta-analysis of 17 runs submitted to the TQIC subtask, the classification 
of recency queries was found to be the most difficult with 56% accuracy, and past 
queries were the easiest with 73%. Another overall trend was that no single approach 
was effective across the four temporal classes. A confusion matrix showed that: (1) 
atemporal queries are likely to be confused as either recency or past queries (16.7% 
and 9.6%, respectively), (2) past queries are likely to be confused as atemporal 
queries (13.1%), (3) recency queries are likely to be confused as future or atemporal 
queries (28.2% and 13.5%, respectively), and (4) future queries tend to be confused 
as recency queries (25.9%). Correlation analysis suggested that it was difficult to 
apply the same technique to predict recency queries and atemporal queries with high 
accuracy. 

The TIR subtask showed a similar pattern with varied performance across the four 
classes. No single system was able to perform the best for all classes. The learning-to- 
rank approach was effective for atemporal and past queries, while BM25 performed 
well for recency and future topics. 

The meta-analysis of 37 runs submitted to the TID subtask suggested that when 
a query was temporally ambiguous and multiple temporal classes can be inferred, 
detecting atemporal features was the most difficult. Also, some techniques were 
good at modeling temporally less diverse queries (i.e., a fewer number of nonzero 
probability classes), while other methods were good at modeling temporally more 
diverse queries. 

The results of the TDR subtask suggested that a learning-to-rank approach was 
effective in retrieving relevant documents for all classes compared to BM25. How- 
ever, the best performance on temporal search result diversification was obtained by 
a round-robin of BM25 rankings of four temporal classes, suggesting that there is 
still room for improvement in this area. 


9.3.3.2 Implications for Test Collection 
Document Collections 


One of the challenges in building a test collection for temporal-aware technolo- 
gies was to obtain access to document collections that have rich temporal features. 
Temporalia was fortunate to have support to use the “LivingKnowledge news and 
blogs annotated sub-collection” constructed by the LivingKnowledge project and 
distributed by the Internet Memory Foundation. The collection was approximately 
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20 GB large when uncompressed and over 5 GB large when zipped. The collection 
spanned from May 2011 to March 2013 and contains around 3.8 million documents 
collected from about 1,500 different blogs and news sources. The data were split into 
970 files based on the date and sources (there might be more than one file per day). 
Texts in the collection were annotated by entities and by temporal expressions that 
were resolved to a specific day, month, or year (Matthews et al. 2010). The relative 
expressions such as “next month” was resolved based on the publication date of the 
articles. 

In Temporalia-2, we also made efforts to diversify the target language of document 
collections to Chinese using SogouCA-20127 and SogouT-2012.° Similar to the 
English collection, SogouCA-2012 was based on news articles from major publishers 
in China. For annotating temporal expressions, a variant of the standard format 
TIMEX3 used in TempEval task was applied.* 


Relevance Assessments 


Another challenge we faced during the construction of Temporalia test collections 
was relevance assessment. The temporality of topics and relevance can be subjective 
and not always deterministic. Therefore, we used a mixture of methods to ensure 
that both queries and documents were temporally annotated for evaluation. 

We had a combination of workshops and crowdsourcing in formal runs. In another 
series of workshops, participants (not necessarily the same people as topic creators) 
were asked to read the formal run topic descriptions carefully and assess the relevance 
of the retrieved documents. 

The documents were then evaluated using crowdsourcing as for their relevance to 
each of the temporal subclasses. For each assigned subtopic, CrowdFlower workers 
were asked to identify at least one highly relevant and one irrelevant document. They 
were also asked to note the relevant text from original documents in the case of highly 
relevant documents. The relevance of these documents was verified by a third person 
during the workshop to improve their reliability. 

The documents initially identified by the workshop participants were then used 
as “test questions” of crowdsourcing jobs. Test questions were questions that crowd- 
sourcing workers had to pass to participate in our relevance assessment jobs. We 
used CrowdFlower to run relevance judgments. Our configuration of crowdsourcing 
is based on common settings used by various IR evaluations (e.g., Kazai et al. 2013). 


e Each task had five documents to judge 

e Ten cents were paid for one task 

e Each task had 120 s of minimum work time 
e Each document had at least three judgments 


*http://www.sogou.com/labs/dl/ca.html. 
3http://www.sogou.com/labs/dl/t-e.html. 
4http://www.timeml.org/tempeval2/tempeval2-trial/guidelines/timex3guidelines-072009.pdf. 
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We had several iterations of revising job instructions and relevance criteria before 
running all formal run subtopics. We tested both detailed instructions and simple 
instructions, but we received mixed responses from workers. Also, detailed instruc- 
tions caused the time required for relevance assessment to increase too much. After 
several iterations, we decided to use the following three levels of relevance criteria. 


Not Relevant The web page does not contain any information to answer the search 
question. 

Highly Relevant The web page discusses the answer to the search question exhaus- 
tively. In the case of a multifaceted search question, all or most subthemes or view- 
points are covered. Typical extent: several text paragraphs, at least four sentences 
or facts. 

Relevant The web page contains some information to answer the question, but 
the presentation is not exhaustive. In the case of a multifaceted search question, 
only some of the subthemes or viewpoints are covered. Typical extent: one text 
paragraph, or one to three sentences or facts. 


9.4 Related Work and Broad Impacts 


After introducing temporal information retrieval task as a part of GeoTime task at 
NTCIR 8, there were several lines of research emerged as a variation of temporal 
information retrieval. Kanhabua et al. (2015) is a comprehensive textbook that intro- 
duces such research results. Moulahi et al. (2016) also summarizes past efforts in 
temporal information retrieval evaluation and discuss future directions. From these 
results, we would like to introduce some research that is highly related to the tasks 
discussed above. 

Strétgen and Gertz (2013), Daoud and Huang (2013) both proposed proximity 
methods for the Geotemporal Information Retrieval task. In this method, proximity 
of the geographic and temporal information are considered for ranking documents in 
addition to the standard information retrieval ranking such as BM25. Another inter- 
esting example is event-centric search and exploration (Strdtgen and Gertz 2012). 
This framework was proposed for analyzing historic documents using geographic 
and temporal constraints constructed from event information. In the discussion of 
GeoTime, there was a consideration of using the name of an event for time con- 
straints. This event-centric approach utilizes these characteristics to find documents 
relevant to the event for exploration. 

There have been related efforts to construct test collections for Information Access 
technologies with temporal awareness, such as the TREC Temporal Summarization 
Track (2012-2015) (Aslam et al. 2015; Guo et al. 2013) and TREC Knowledge Base 
Acceleration Track (2012-2014) (Frank et al. 2014). The TREC Temporal Summa- 
rization Track had two subtasks: Sequential Update Summarization and Value Track- 
ing. Sequential Update Summarization sought to find timely, sentence-level, reliable, 
relevant, and nonredundant updates about developing event, while Value Tracking 
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aimed at tracking values of event-related attributes that were of high importance to 
the event. TREC Knowledge Base Acceleration Track was a challenge for filtering 
a large stream of text to find documents that can help update knowledge bases like 
Wikipedia, Facebook, or Crunchbase. Both efforts either explicitly or implicitly had 
a focus on recency information about entities. NTCIR Temporalia was, on the other 
hand, designed to facilitate research on diverse temporal attributes in a systematic 
manner. 

There have been several extensions of the original work. For example, Hasanuz- 
zaman et al. (2016) applied temporal query intent classification techniques to stock 
market analysis. Rizzo and Montesi (2017) used the LivingKnowledge collection to 
conduct a temporal analysis of a digital library collection. Finally, Joho et al. (2013, 
2015) used the Temporalia test collection to study temporal information-seeking 
behavior in a controlled user study and a questionnaire-based study. The studies 
identified the difference in resource selection and relevant content types across tem- 
poral attributes of information needs. These are some of the ways in which the test 
collection for temporal information access can have broader impacts than the original 
objectives of the resources. 

See the citation of the overview papers (Gey et al. 2010, 2011; Joho et al. 2014, 
2016) for more details of broader impacts from GeoTime and Temporalia. 


9.5 Conclusion 


We have introduced two tasks related to temporal information access in the NTCIR 
workshop. GeoTime was the first attempt to place more emphasis on temporal search, 
and Temporalia provided a framework to examine the performance of time-aware 
search application using a test collection. The review of the literature suggests that 
these resources have been useful for researchers to advance temporal information 
access technologies and to better understanding temporal information-seeking behav- 
ior. 
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Collection with Click Streams Used in a 
Shared-Task Evaluation 
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and Zhicheng Dou 


Abstract Search logs are very precious for information retrieval studies. In this 
chapter, we will introduce a real Chinese query log dataset, SogouQ, which was 
released by SogouQ corporation in 2010 for the NTCIR-9 Intent task. SogouQ con- 
tains more than 30 million clicks collected in 2008. It is the first large-scale query logs 
used in a shared-task evaluation (i.e., the NTCIR tasks). SogouQ has been adopted 
in a number of follow-up evaluation tasks, NTCIR-10 Intent-2, NTCIR-11 IMine, 
NTCIR-12 IMine-2, as well as in several Chinese domestic tasks. Moreover, SogouQ 
has a broader impact on other research areas, such as natural language processing 
and social science. It has been acquired by more than 200 institutions. 
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10.1 Introduction 


When we were preparing the NTCIR-9 Intent task that aims to investigate query 
intents and search result diversification (Song et al. 2011) in 2010, Sogou corporation 
was so generous to provide a real Chinese query log to NTCIR participants and 
further research communities. The data is called SogouQ and contains 30 million 
clicks collected in 2008. It is the first large-scale query logs used in a shared-task 
evaluation, such as NTCIR tasks. 

The NTCIR-9 Intent task attracted 16 teams for Subtopic Mining subtask and 
8 teams for Document Ranking subtask. It became the largest track in NTCIR-9 
partially because participants are interested in SogouQ and how to use query logs for 
mining intents and diversifying document ranking. Since then SogouQ is used for 
NTCIR-10 Intent-2 task (Sakai et al. 2013), NTCIR-11 [Mine task (Liu et al. 2014), 
and NTCIR-12 IMine-2 task (Yamamoto et al. 2016). The total number of participants 
groups is more than 80. They are from Australia, Canada, China, Germany, France, 
Japan, Korea, Spain, UK, and United States. 

Later SogouQ had an even bigger impact on research. The usage of SogouQ 
data collection goes beyond the research on query intent. SogouQ is also used for 
improving fundamental natural language processing modules, such as name entity 
identification and new word discovery, user behavior studies, and Sociological topics. 
More than 200 institutes have acquired SogouQ related datasets from Tsinghua-Sohu 
Joint Laboratory on Search Technology. We believe that a more practical impact has 
happened but not been reported. 

The remainder of this chapter is organized as follows: Sect. 10.2 describes the 
details of SogouQ and its related data collections. Section 10.3 briefly describes 
how organizers and participants use SogouQ in the NTCIR tasks. Section 10.4 
reports more research impact beyond the works published in NTCIR proceedings. 
Section 10.5 concludes this chapter. 


10.2 SogouQ and Related Data Collections 


SogouQ was constructed by the Tsinghua-Sohu Joint Lab on Search Technology. It 
is a web query log of Sogou search engine for about one month (June 2008). There 
are about 30 million clicks included. The size of compressed SogouQ is about 1.9 
gigabytes and is available for download.! 

It should be noted that several similar click datasets were also released by several 
organizations for research purpose: 


e AOL Query logs (2006/36M queries/English) includes user ids and click data. 
This dataset was intentional and intended for research purposes. However, the 
queries were not filtered and further lead to much controversy about privacy issues. 


‘http://www.sogou.com/labs/resource/q.php. 
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e MSN Query logs (2006/100M queries/English) includes session ids and click- 
through information, but not user ids (Craswell et al. 2009). 

e Yandex Query logs (unknown time/210M queries/Russian) includes user ses- 
sions extracted from Yandex logs, with user ids, queries, query terms, URLs, their 
domains, URL rankings, and clicks. However, the user data is fully anonymized.” 


The data format of SogouQ is as follows: 


[Access time]\t{User ID]\t{Query]\t{Rank of the URL in the returned result]\t[The sequence 
number of user click]\t{URL that user clicked]\n 


Here User ID is automatically assigned according to the cookie information when 
a user accesses the search engine by using the browser. Different queries that are 
input by the same browser correspond to the same user ID. 

Compared to other search log data, SogouQ has several advantages. First, User 
ID and access time can provide information on sessions, which is important for 
session-based retrieval or mining-related searches by session. Second, in addition to 
the clicked URL, SogouQ provides the rank of clicked URL when it was shown to 
the user and which sequence the user clicked URLs for a query. Such information 
is valuable for research on user click modeling. Third, if we have only URLs, the 
content of URLs is difficult to obtain because the web keeps evolving. URLs may 
expire or the content of some URLs may change. Fortunately, Sogou released a 
document collection called SogouT? in 2010, which were crawled in June 2008. 
Therefore, researchers can get the corresponding page content at the same time. 

We appreciate Sogou corporation and Tsinghua-Sohu Joint Lab of Search Tech- 
nology. Due to their deep understanding of search and courage, research communities 
can have such valuable data collections. 


10.3 SogouQ and NTCIR Tasks 


The NTCIR-9 Intent task comprises the Subtopic Mining subtask (given a query, 
output a ranked list of possible subtopic strings) and the Document Ranking subtask 
(given a query, output a ranked list of URLs that are selectively diversified). In 
the Subtopic Mining subtask, a subtopic could be a specific interpretation of an 
ambiguous query (e.g., “microsoft windows” or “house windows” in response to 
“windows”) or an aspect of a faceted query (e.g., “windows 7 update” in response 
to “windows 7”). The subtopics collected from participants were pooled, manually 
clustered, and thereby used as a basis for identifying the search intents of the query. 
The probability of each intent given the query was estimated through assessor voting. 
In the Document Ranking subtask, in contrast to traditional relevance assessments 
where the assessors determine the relevance of each pooled document with respect 
to a topic, we required the assessor to provide graded relevance assessments with 


7https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data. 
3http://www.sogou.com/labs/resource/t.php. 
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respect to each intent of a given query. Finally, the relevance and diversity of the 
ranked subtopics or documents were evaluated using diversified information retrieval 
metrics (Sakai and Song 2014). 

SogouQ was used by every participant for mining subtopics for given queries 
or estimating the importance of subtopics according to the number of clicks (Han 
et al. 2011; Wang et al. 2013; Xue et al. 2011; Yu and Ren 2014). The subtopics and 
their importance will influence document ranking then. Thus when user queries and 
clicks are introduced to the subtopic pool via SogouQ, our manually labeled intents 
or documents model the information needs of real users more accurately. Such an 
evaluation benchmark helps research on information retrieval in universities or labs 
without commercial search engines as experimental platforms. 

In NTCIR-10 Intent-2 task, organizers provide the following instruction on 
subtopic: 


A subtopic string of a given query is a query that specialises and/or disambiguates the search 
intent of the original query. If a string returned in response to the query does neither, it is 
considered incorrect. 


e.g. original query: “harry potter” (underspecified) subtopic string: “harry potter philosophers 
stone movie” incorrect: “harry potter hp” (doe not specialise) 


It is encouraged that participants submit subtopics of the form “<originalquery> 
<additionalstring>” 


Assessors were asked to provide a label for each intent cluster in the form “<origi- 
nalquery><additionalstring>”. Such a change provides valuable data to better under- 
stand a query in the perspective of two intent roles, i.e., kernel-object and modifier 
(Ren and Yu 2016; Yu and Ren 2012; Zheng et al. 2018). In contrast to the NTCIR-9 
Intent task where we had up to 24 intents for a single topic, organizers of Intent-2 
decided to select up to 9 intents per topic based on votes because search result diver- 
sification is mainly about diversifying the first search result page, which can only 
accommodate around ten URLs. 

NTCIR-11 [Mine task continued Subtopic Mining subtask and Document Rank- 
ing subtask and started a new subtask called TaskMine, which aims to explore the 
methods of automatically finding subtasks of a given task (e.g., for a given task “lose 
weight”, the possible outputs can be “do physical exercise”, “take calories intake”, 
“take diet pills”, etc.). In the Subtopic Mining subtask, participants are expected to 
generate a two-level hierarchy of underlying subtopics by analysis into the provided 
document collection, user behavior data including SogouQ, or other kinds of exter- 
nal data sources. For example, given the ambiguous query “windows”, the first-level 
subtopic may be “microsoft windows”, “software on windows platform”, or “house 
windows”. In the category of “microsoft windows”, users may be interested in dif- 
ferent aspects (second-level subtopics), such as “windows 8” and “windows update”. 
The hierarchical structure of subtopics is closely related with the knowledge graph. 
However, the hierarchical subtopics here are used to describe users’ possible infor- 
mation needs instead of the manually created knowledge structure of entity names. 
Organizers encouraged participants not to use the graph directly even when a knowl- 
edge graph exists for a given query. Therefore, user behavior data, such as SogouQ, 
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play important roles in creating the hierarchy of subtopics as real user queries reflect 
users’ possible information needs. 

NTCIR-12 IMine-2 task focuses on vertical intents behind a query as well as its 
topical intents because many commercial Web search engines merge several types 
of search results and generate a SERP (search engine results page) in response to 
a user’s query. For example, the results of query “flower” now may contain image 
results and encyclopedia results as well as usual Web search results. We refer to such 
“types” of search results as verticals. Accordingly, the IMine-2 task comprises two 
subtasks: the Query Understanding subtask and the Vertical Incorporating subtask. 
The Query Understanding subtask is a successive task of the Subtopic Mining subtask 
but the difference is that participants are asked to identify the relevant verticals for 
each subtopic. For example, for the query “iPhone 6”, a possible result list of the 
Query Understanding subtask is: 


[tid] [subtopic] [vertical] [score] 
IMINE2-E-000 iPhone 6 apple.com Web 0.9 
IMINE2-E-000 iPhone 6 sales News 0.90 
IMINE2-E-000 iPHone 6 photo Image 0.88 
IMINE2-E-000 iPhone 6 review Web 0.78 


The Vertical Incorporating subtask is also a successive task of the Document Ranking 
subtask. The difference is that the participants should decide whether the result list 
should contain vertical result or not. SogouQ is still a useful resource of user behaviors 
for Chinese subtasks. Similarly, Yahoo! Japan provides the participants of Japanese 
subtasks a Web search related query data, which is generated from the query log of 
Yahoo! Japan Search from July 2009 to June 2013.4 


10.4 Impact of SogouQ 


As by April 30, 2019, we can find 82 papers when we search the keyword “SogouQ” 
in Google Scholar. Most of them are not published in NTCIR proceedings. 

Some works such as Gu et al. (2016), Han et al. (2011), Ren et al. (2015), Xue 
et al. (2011), Kim and Lee (2015), and Zheng et al. (2015) use SogouQ to mine 
subtopics (Song et al. 2018; Wang et al. 2013; Yu and Ren 2014), or suggestions 
(Li and Wang 2014; Liu et al. 2017; Shu et al. 2013). Some works like Zheng et al. 
(2018) use SogouQ for better understanding a query in the perspective of two intent 
roles, i.e., kernel-object and modifier (Ren and Yu 2016; Yu and Ren 2012). Some 
other works investigate intent shifting (Wang and Chen 2011), query specification 
(Xiangbin et al. 2015), and search task identification (Du et al. 2018). Some works use 
SogouQ for improving some fundamental modules of natural language processing, 
such as unsupervised dependency parsing (Qiao et al. 2016), new word identification 


4http://research.nii.ac.jp/ntcir/news-20150717-ja.html. 
Shttp://scholar.google.com. 
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(Xuewei 2014), and person name recognition (Lv et al. 2013; Wen et al. 2013). 
Moreover, the rich information of SogouQ provides evidence to get statistics, e.g., 
query per second (Fang et al. 2017), sample queries (Liu and Li 2014); or mine a 
particular type of queries, e.g., time-sensitive search queries (Pei et al. 2016) and 
health search queries; or predict authoritative of website (Yu and Ren 2018). 

Some usage of SogouQ is on broader research topics. Rao et al. (2014) constructs 
query co-occurrence network from SogouQ and compares the network with Named 
Entity Person co-occurrence network and the network based on the co-occurrence of 
words in sentences of news articles; Wang and Pleimling (2017) use it to investigate 
foraging patterns in online searches. Authors analyze three different click-through 
logs and discover an increased efficiency of the search engines. In the language of 
foraging, the newer logs indicate that online searches overwhelmingly yield local 
searches (i.e., on one page of links provided by the search engines), whereas for the 
older logs, the foraging processes are a combination of local searches and relocation 
phases that are power law distributed. It follows that good search engines enable 
the users to find the information they are looking for through a local exploration of 
a single page with search results, whereas for poor search engine, users are often 
forced to do a broader exploration of different pages. 

According to the statistics from Tsinghua-Sohu Joint Lab on Search Technology, 
more than 200 institutions have acquired SogouQ related datasets. We believe that a 
more practical impact has happened but not been reported. 


10.5 Conclusion 


The problems that are explored in NTCIR Intent and [Mine tasks require a data col- 
lection of query logs. With the great support of Sogou corporation, SogouQ becomes 
the first query logs that are used in a shared evaluation. Compared to other query 
logs, SogouQ has richer information on session, ranking, and orders of clicks, and 
corresponding documents if being combined with SogouT. Therefore, SogouQ does 
not only support research on query understanding of intent and vertical, but also 
enable many works on broader research topics on web search user behaviors. More 
than 200 institutes have acquired SogouQ data and they are using the query logs for 
various research and applications. 

As query logs are too sensitive, it is difficult to obtain more shared query logs. 
Some efforts were done to simulate click-through data, such as Sogou-QCL (Zheng 
et al. 2018), to enable the neural-based works that need a larger amount of data. 
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Chapter 11 A) 
Evaluation of Information Access E 
with Smartphones 


Makoto P. Kato 


Abstract NTCIR 1CLICK and MobileClick are the earliest attempts toward test- 
collection-based evaluation for information access with smartphones. Those cam- 
paigns aimed to develop an IR system that outputs a short text summary for a given 
query, which is expected to fit a small screen and to satisfy users’ information needs 
without requiring much interaction. The textual output was evaluated on the basis 
of iUnits, pieces of relevant text for a given query, with consideration of users’ 
reading behaviors. This chapter begins with an introduction to NTCIR 1CLICK and 
MobileClick, explains the evaluation methodology and metrics such as S-measure 
and M-measure, and finally discusses the potential impacts of those evaluation cam- 
paigns. 


11.1 Introduction 


In 2015, Google announced that more searches took place on mobile devices than on 
desktop computers in 10 countries including the US and Japan.' Among diverse types 
of mobile devices, the smartphone has become dominant according to a survey in 
2015.” Thus, there is no doubt that the smartphone is one of the most important search 
environments for which search engines should be designed, due to its popularity and 
several differences from traditional devices, e.g., desktop computers. 

The search experience difference between desktop computers and smartphones 
mainly comes from the differences in screen size, internet connection, interaction, 
and situation. A relatively small screen size limits the amount of content which the 
users can read at a time. The internet connection is sometimes unstable depending on 


‘https://adwords.googleblog.com/2015/05/building-for-next-moment.html. 
7 https://www.pewglobal.org/2016/02/22/smartphone-ownership-and-internet-usage-continues- 
to-climb-in-emerging-economies/. 
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where users conduct search. While the keyboard and mouse are typical input devices 
for desktop computers, touch interaction and speech input are often used for smart- 
phones and may not be suitable for inputting or editing many keywords. Search with 
smartphones can sometimes be interrupted by the other activities with which the user 
is engaged simultaneously. To overcome the limitations in search with smartphones, 
research communities have studied new designs of interface, interaction, and search 
algorithms suitable for smartphones (Crestani et al. 2017). 

NTCIR 1CLICK (Kato et al. 2013a; Sakai et al. 201 1b) and MobileClick (Kato 
et al. 2014, 2016b) are the earliest attempts toward test-collection-based evaluation 
for information access with smartphones. Those campaigns aimed to develop an IR 
system that outputs a short text summary for a given query, which is expected to 
fit a small screen and to satisfy users’ information needs without requiring much 
interaction. The textual output was evaluated on the basis of pieces of relevant text 
for a given query. The basic task design is similar to query-biased multi-document 
summarization (Carbonell and Goldstein 1998; Tombros and Sanderson 1998), in 
which a system is expected to generate a summary of a fixed length from multiple 
documents, satisfying the information need of users who input a certain query. The 
main difference from the query-biased multi-document summarization task is posi- 
tion awareness of presented information. In the NTCIR 1CLICK and MobileClick 
tasks, more important information is expected to be present at the beginning of 
the summary so that users can reach such information efficiently. In other words, 
more relevant information pieces should be ranked at higher positions like an ad hoc 
retrieval task. Accordingly, evaluation measures used in these tasks were designed 
to be position-aware, unlike those for text summarization such as recall, precision, 
and ROUGE (Lin 2004). This task design and evaluation methodology distinguishes 
NTCIR 1CLICK and MobileClick from the other summarization tasks, and had some 
impact on mobile information access and related fields. 

This chapter first describes the task design of NTCIR 1CLICK and MobileClick, 
introduces evaluation methodologies used in these campaigns, and finally discusses 
potential impacts on works published after NTCIR 1CLICK and MobileClick. 


11.2 NTCIR Tasks for Information Access 
with Smartphones 


This section provides a brief overview of the task design of the NTCIR 1CLICK and 
MobileClick tasks. Table 11.1 summarizes four NTCIR tasks to be described in this 
section.” 


3St#-measure is a combination of S-measure and T-measure (a precision-like metric) (Sakai and 
Kato 2012). 
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Table 11.1 NTCIR tasks for information access with smartphones 


Year NTCIR Task Subtasks Primary metric 

2010 9 1CLICK-1 S-measure 

2011 10 1CLICK-2 Main & Query St-measure? 
classification 

2013 11 MobileClick-1 iUnit retrieval & M-measure 
iUnit summarization 

2014 12 MobileClick-2 iUnit retrieval & M-measure 
iUnit summarization 


11.2.1 NTCIR ICLICK 


The history of information access with smartphones in NTCIR began from a subtask 
of the NTCIR-9 INTENT task, namely, NTCIR-9 1CLICK-1 (formally, one-click 
access task) (Sakai et al. 2011b). While the NTCIR-9 INTENT task targeted search 
result diversification, the NTCIR-9 1CLICK-1 task focused especially on generating 
a query-biased summary as a proxy for a search engine result page (or “ten blue 
links”), for satisfying the user immediately after the user clicks on the search button. 
Strictly speaking, the NTCIR-9 1CLICK-1 task was designed not for information 
access with smartphones, but for Direct and Immediate Information Access, which 
was defined in earlier work by the task organizers (Sakai et al. 201 1a): 


We define Direct Information Access as a type of information access where there is no user 
operation such as clicking or scrolling between the user’s click on the search button and 
the user’s information acquisition; we define Immediate Information Access as a type of 
information access where the user can locate the relevant information within the system 
output very quickly. Hence, a Direct and Immediate Information Access (DIA) system is 
expected to satisfy the user’s information need very quickly with its very first response. 


While the NTCIR-9 1CLICK-1 task was treated as a pilot task and targeted only 
the Japanese language, the 1CLICK-2 task was organized as an independent task 
at NTCIR-10 and employed almost the same task design as that of the NTCIR-9 
1CLICK-1 task, with the scope extended to Japanese and English. 

At both 1CLICK-1 and 1CLICK-2, participants were given a list of queries cat- 
egorized into four query categories, namely, celebrity, local, definition, and Q&A. 
The task organizers selected these categories the following work by Li et al. (2009), 
which investigated Google’s desktop and mobile query logs of three countries, and 
identified frequent query types for good abandonment—an abandoned query for 
which the user’s information need was successfully addressed by the search engine 
result page without clicks or query reformulation. 

NTCIR-9 1CLICK-1 and NTCIR-10 1CLICK-2 participants were expected to 
produce a plain text of X characters for each query (X = 140 for Japanese and X = 
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Christopher jonathan james nolan. Born 30 july 
1970. Film director. Often worked with his brother, 
jonathan nolan. Produced most critically and 
commercially successful films. Emphasises 
dialogue scenes framed in wide close-up. 


Garnered 26 oscar nominations and seven Seco n d layer 
e 
°° oD» 
Films 
R tati 9 films grossed us$4.2 billion worldwide. 
eputation Began making films at age seven. Debut with 
Film characteristics the film 'following'. Second film, ‘memento’. 


Film:'insomnia'. Considerable technical 
virtuosity and visual flair. Emphasises urban 


First layer settings 


Fig. 11.1 A two-layered summary for query “christopher nolan”. Users can see the second 
layer if they click on a link in the first layer 


280 for English),* based on a given document collection. The output was expected 
to include important pieces of information first and to minimize the amount of text 
the user has to read. These requirements are more formally described through the 
evaluation metrics explained in Sect. 11.3. 


11.2.2 NTCIR MobileClick 


NTCIR MobileClick, which started from NTCIR-11, took over the spirit of NTCIR 
1CLICK, and aimed to directly return a summary of relevant information and imme- 
diately satisfy the user without requiring much interaction. Unlike the 1CLICK tasks, 
participants were expected to produce a two-layered summary that consists of a sin- 
gle first layer and multiple second layers, as shown in Fig. 11.1. The first layer is 
expected to contain information interesting for most of the users, and the links to 
the second layer; the second layer, which is hidden until its header link is clicked 
on, is expected to contain information relevant for a particular type of users. In a 
two-layered summary, users can avoid reading text in which they are not interested, 
thus saving time spent on non-relevant information, if they can make a binary yes/no 
decision of each second-layer entry from the head link alone. 


4Both NTCIR-9 1CLICK-1 and NTCIR-10 1CLICK-2 accepted two types of runs, namely, DESK- 
TOP and MOBILE runs. In this chapter, only MOBILE runs are explained for simplicity. 
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This unique output was motivated by the discussion at the NTCIR-10 conference 
in June 2013, and reflected the rapid growth of smartphone users in those years. 
Although 1CLICK expects no interaction except for clicking on the search button, 
MobileClick targeted smartphone users and expects users to tap on some links for 
browsing desired information efficiently. 

NTCIR MobileClick assumed different types of users who are interested in dif- 
ferent topics. The diversity of users who input a certain query was modeled by intent 
probability, which is the probability over intents for the query. For example, among 
users who input “apple” as a query, 90% are interested in Apple Inc. and 10% are 
interested in apple the fruit. A two-layered summary is considered good if different 
types of users are all satisfied with the summary. Thus, the first layer should not con- 
tain information in which a particular type of users are interested, while the second 
layers should not contain information relevant to the majority of users. 

The input in the NTCIR-11 MobileClick-1 and NTCIR-12 MobileClick-2 tasks 
was a list of queries that were basically categorized into four types mentioned earlier. 
There were two subtasks in these evaluation campaigns: iUnit retrieval and iUnit sum- 
marization subtasks. In iUnit retrieval subtask, participants were expected to output 
a ranked list of information pieces called iUnit in response to a given query. In iUnit 
summarization subtask, as was explained earlier, the output was a two-layered sum- 
mary in XML format. While the NTCIR-11 MobileClick-1 required participants to 
identify information pieces from a document collection, the NTCIR-12 MobileClick- 
2 only required selecting and ranking or arranging predefined information pieces, 
mainly for increasing the reusability of the test collection. 


11.3 Evaluation Methodology in NTCIR 1CLICK and 
MobileClick 


This section explains and discusses some details of the evaluation methodology used 
in the NTCIR 1CLICK and MobileClick tasks, which is mainly based on nuggets, 
or pieces of information we call iUnits. We first present the background and explain 
the differences between summarization and our tasks. We then focus on the notions 
of nuggets and iUnits, and finally discuss the effectiveness metrics developed and 
used in the NTCIR tasks. 


11.3.1 Textual Output Evaluation 


Summarization is one of the most similar tasks to NTCIR 1CLICK and MobileClick. 
As mentioned earlier, the most notable difference between the summarization and 
these NTCIR tasks is position awareness of information pieces in the textual out- 
put. This subsection details and discusses the difference in terms of the evaluation 
methodology. 
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Automatic evaluation of machine-generated summaries has been often conducted 
by comparison with human-generated summaries (Nenkova and McKeown 2011). 
ROUGE is a widely used evaluation metric based on word matching between a 
machine summary and human summaries (Lin 2004). There are several variants 
of ROUGE such as ROUGE-W (n-gram matching), ROUGE-L (longest common 
sequence), and ROUGE-S (skip-gram matching). Although these variants are sensi- 
tive to the order of words, they are agnostic to the absolute position where each word 
appears in a machine summary. The Pyramid method identifies Summary Content 
Units (SCUs), which are word spans expressing the same meaning, from multiple 
human summaries, and computes a score for each machine summary based on the 
included SCUs (Nenkova et al. 2007). The weight of an SCU is determined by the 
number of human summaries including the SCU, and a summary is scored basically 
by the sum of the weights of SCUs within the summary. The position of SCUs within 
a machine summary does not affect the score. 

The insensitivity for the position of information pieces (i.e., words or SCUs) is 
reasonable when it is assumed that the whole summary is always read. In such a case, 
the position of information pieces should not affect the utility of the summary, as all 
the information pieces are equally consumed by the reader. 

On the other hand, the position matters when users may read different parts of a 
summary. As the textual output in NTCIR 1CLICK is expected to be scanned from 
top to bottom, like Web search, contents near the end have a smaller probability to 
be read, and, accordingly, should be discounted when the utility is estimated. The 
two-layer summary in NTCIR MobileClick can be read in many different ways. A 
user may read only the first layer, while another user may scan contents in the first 
layer from top, click on a link interesting for the user, read a second layer shown 
by the click, and stop reading at the end of the second layer. Therefore, the primary 
difference from ordinary summarization tasks is how the summary is expected to be 
read, which naturally required different evaluation methodologies. 


11.3.2 From Nuggets to iUnits 


The NTCIR-9 1CLICK-1 task evaluated the system output based on nuggets. Nuggets 
are fragments of text, which were frequently used in summarization and ques- 
tion answering evaluation. TREC Question Answering track defined an information 
nugget as “a fact for which the assessor could make a binary decision as to whether 
a response contained the nugget” (Voorhees 2003). The possibility of the binary 
decision is called atomicity (Dang et al. 2007). As explained earlier, the Pyramid 
method (Nenkova et al. 2007) uses SCUs as units of comparison: 


SCUs are semantically motivated, subsentential units; they are variable in length but not 
bigger than a sentential clause. This variability is intentional since the same information 
may be conveyed in a single word or a longer phrase. SCUs emerge from annotation of a 
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collection of human summaries for the same input. They are identified by noting information 
that is repeated across summaries, whether the repetition is as small as a modifier of a noun 
phrase or as large as a clause. 


Babko-Malaya described a systematic way to uniform the granularity of nuggets 
based on several nuggetization rules (Babko-Malaya 2008). Examples of the rules 
are shown below: 


Nuggets are created out of each core verb and its arguments, where the maximal extent of 
the argument is always selected. 


Noun phrases are not decomposed into separate nuggets, unless they contain temporal, 
locative, numerical information, or titles. 


Basic elements are another attempt to systematically define nuggets (Hovy et al. 
2006), and were defined as follows: 


the head of a major syntactic constituent (noun, verb, adjective or adverbial phrases), 
expressed as a single item, or a relation between a head-BE and a single dependent, expressed 
as a triple (head—modifier—relation). 


Although several attempts had been made to standardize the nuggetization proce- 
dure, the task organizers of NTCIR 1CLICK still found it hard to identify nuggets. 
The primary difficulty is to uniform the granularity of nuggets. While the notion of 
atomicity determines the unit of nuggets to some extent, there were some cases in 
which assessors disagreed. Typical examples are shown below: 


1. Tetsuya Sakai was born in 1988. 
2. Takehiro Yamamoto received a PhD from Kyoto University in 2011. 


The following pieces are candidates for nuggets in sentences 1 and 2. 


1-A. Tetsuya Sakai was born in 1988. 

1-B. Tetsuya Sakai was born. 

2-A. Takehiro Yamamoto received a PhD from Kyoto University in 2011. 
2-B. Takehiro Yamamoto received a PhD in 2011. 

2-C. Takehiro Yamamoto received a PhD from Kyoto University. 

2-D. Takehiro Yamamoto received a PhD. 


Although 1-B and 2-D are results of a similar type of decomposition, 1-B does not 
look appropriate for a nugget, but 2-D does. Whereas, 2-D may not be an appropriate 
nugget if the query is “When did Takehiro Yamamoto receive his PhD?” since 2-D 
can be a trivial fact like 1-B. A systematic approach may not be very helpful in this 
case. 

Another difficulty is the way to determine the weight of nuggets. Unlike the 
Pyramid method and others, the NTCIR-9 1CLICK-1 task extracted nuggets from 
a document collection from which the textual output is generated, not from those 
generated by human assessors. This methodology was chosen because there were 
hundreds of nuggets for some queries, which cannot be included in a few human- 
generated summaries. The weighting schema used in the Pyramid method cannot 
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be simply applied to this case, as the number of assessors who found a nugget may 
simply reflect the frequency of the nugget in the collection, but it might be unrelated 
to the importance of the nugget. Furthermore, the dependency of nuggets makes the 
problem more complicated. For example, 2-B entails 2-D. Then, what is the score 
of a summary including 2-B? Is it the sum of the weights of 2-B and 2-D, or 2-B’s 
alone? 

To clarify the definition of nuggets and weighting schema in NTCIR ICLICK, 
the task organizers of the NTCIR-10 1CLICK-2 opted to redefine nuggets and call 
them information units or iUnits. 

iUnits satisfy three properties, relevant, atomic, and dependent, described in detail 
below. Relevant means that an iUnit provides useful factual information to the user 
on its own. Thus, it does not require other iUnits to be present in order to provide 
useful information. For example: 


1. Tetsuya Sakai was born in 1988. 
2. Tetsuya Sakai was born. 


If the information need is “Who is Tetsuya Sakai?”, (2) alone is probably not useful, 
and therefore this is not an iUnit. Note that this property emphasizes that the infor- 
mation need determines which pieces of information are iUnits. If the information 
need is “Where was Tetsuya Sakai born?”, both cannot be iUnits. 

Atomic means that an iUnit cannot be broken down into multiple iUnits without 
loss of the original semantics. Thus, if it is broken down into several statements, at 
least one of them does not pass the relevance test. For example: 


1. Takehiro Yamamoto received a PhD from Kyoto University in 2011. 
2. Takehiro Yamamoto received a PhD in 2011. 

3. Takehiro Yamamoto received a PhD from Kyoto University. 

4. Takehiro Yamamoto received a PhD. 


(1) can be broken down into (2) and (3), and both (2) and (3) are relevant to the 
information need “Who is Takehiro Yamamoto?”. Thus, (1) cannot be an iUnit, 
but (2) and (3) are iUnits. (2) can be further broken down into (4) and “Takehiro 
Yamamoto received something in 2011”. However, the latter does not convey useful 
information for the information need. The same goes for (3). Therefore, (2) and (3) 
are valid iUnits and (4) is also an iUnit. 

Dependent means that an iUnit can entail other iUnits. For example: 


1. Takehiro Yamamoto received a PhD in 2011. 
2. Takehiro Yamamoto received a PhD. 


(1) entails (2) and they are both iUnits. 

In the NTCIR-10 1CLICK-2, nuggets were first identified from a document col- 
lection, and iUnits were extracted from the nuggets." A set of iUnits for query 1C2-J- 
0001 “RER (Mai Kuraki; a Japanese singer-songwriter)” is shown in Table 11.3, 


>This approach was taken mainly for increasing the efficiency by dividing the iUnit extraction task 
into two parts. 
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Table 11.2 Nuggets for query 1C2-J-0001 “I RHR (Mai Kuraki; a Japanese singer-songwriter)” 


ID Nugget 


S005 9910H, 160% C“Mai K'O TBabyI Like] C@KF¢a—. A4EI2A8A 
I'Love, Day After Tomorrow! ARREK HAASE A, (She made her Amer- 
ican debut with “Baby I Like” as “Mai K” in October 1999, when she was 16 years 
old. In the same year, on December 8, she made her debut in Japan with “Love, Day 
After Tomorrow”) 

S008 LAY BAY (Blood type: B) 

S012 HRS MF (Occupation: Singer) 

S022 2005% ARRAS & ZÆ (She graduated from Ritsumeikan University in 2005.) 

S023 SHBISMAAD—V EF 4 AZ KËT [delicious way] BUY Z: FANA: 
AT-F+t¥—*, [Secret of my heart) HYY- X7: YF- 1P- 
i, (“delicious way” won “Rock album of the Year” and “Secret of my heart” won 
“Song of the Year” at the 15th annual Japan Gold Disc Awards) 


which were extracted from nuggets in Table 11.2. The column “Entails” indicates a 
list of iUnits that are entailed by the iUnit. For example, iUnit I014 entails 1013, and 
iUnit 1085 entails iUnits 1023 and 1033. A semantics is the factual statement that the 
iUnit conveys. This is used by assessors to determine whether an iUnit is present in 
a summary. 

A vital string is a minimally adequate natural language expression and extracted 
from iUnits. This approximates the minimal string length required so that the user 
who issued a particular query can read and understand the conveyed information. 
The vital string of iUnit u that entails iUnits e(u) does not include that of iUnits e(u) 
to avoid duplication of vital strings, since if iUnit u is present in a summary, iUnits 
e(u) are also present by definition. For example, the vital string of iUnit 1014 does 
not include that of iUnit 1013 as shown in Table 11.3. Even the vital string of 1085 is 
empty as it entails iUnits 1023 and 1033. 

Having extracted iUnits from nuggets, assessors gave the weight to each iUnit 
on five-point scale (very low (1), low (2), medium (3), high (4), and very high (5)). 
iUnits were randomly ordered and their entailment relationship was hidden during the 
voting process. After the voting, we revised iUnit’s weight so that iUnit u entailing 
iUnits e(u) receives the weight of only u excluding that of e(u). This revision is 
necessary because the presence of iUnit u in a summary entails that of iUnits e(u), 
resulting in duplicative counting of the weight of e(u) when we take into account 
the weight of both u and e(u). 

For example, suppose that there are only four iUnits: 


Ichiro was a batting champion (3). 

Ichiro was a stolen base champion (3). 

Ichiro was a batting and stolen base champion (7). 

Ichiro was the first player to be a batting and stolen base champion since Jackie 
Robinson in 1949 (8). 


Pe ee 
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Table 11.3 iUnits for query 1C2-J-0001 “AHR (Mai Kuraki; a Japanese singer-songwriter)” 


ID  Entails Nugget Semantics Vital string 
1011 S005 19994 HRF E a— (Made her 1999% HATE 1 — (Made 
Japanese debut in 1999) her Japanese debut in 1999) 
1012 S008 MLB HY (Blood type: B) MAIBAL (Blood type: B) 
1013 S022 ZANKI (Graduated from Rit- MAKAZE (Graduated from 
sumeikan University) Ritsumeikan University) 
1014 1013 S022 20054 37 fir RE AK A (Graduated 20054F: (2005) 
from Ritsumeikan University in 
2005) 
1017 S012 Æ WF (Occupation: Singer) WF (Singer) 
1023 s023 BISMAAKA- VET AAD WISHART- FFT AR 


KAVVD- ATP -4¥ PRAVVYT- ATF: 
— EY (Won “Song of the Year” 1 VV — 4 = (Won “Song of 
at the 15th annual Japan Gold Disc the Year” at the 15th annual 


Awards) Japan Gold Disc Awards) 
1033 S023 VY FIV Secret of my heart (Sin- V Y 7 IV Secret of my heart 
gle “Secret of my heart”) (Single “Secret of my heart”) 


1085 1023, 1033 S023 #BISH]IHAA-VEF4 AD 
SÆ T Secret of my heart || D3 
YUP r a a A 
= È (“Secret of my heart” won 
“Song of the Year” at the 15th an- 
nual Japan Gold Disc Awards) 


where (4) entails (3), and (3) entails both (1) and (2). A parenthesized value indicates 
the weight of each iUnit. Suppose that a summary contains (4). In this case, the 
summary also contains (1), (2), and (3) by definition. If we just sum up the weight of 
iUnits in the summary, the result is 21(= 3 + 3 + 7 + 8), where the weight of (1) and 
(2) is counted three times and that of (3) is counted twice. Therefore, it is necessary 
to subtract the weight of entailing iUnits to avoid the duplication; in this example, 
thus, the weight of iUnits becomes 3, 3, 4(= 7 — 3), and 1(= 8 — 7), respectively. 

More formally, we used the following equation for revising the weight of iUnit 
u: 


w(u) — max w(u’), (11.1) 


u'ce(u) 


where w(u) is the weight of iUnit u. Note that iUnits e(u) in the equation above 
are ones entailed by iUnit u and the entailment is transitive, i.e. if i entails j and j 
entails k, then į entails k. 
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11.3.3 S-Measure 


S-measure (Sakai et al. 2011a) was the primary evaluation metric at NTCIR-9 
1CLICK-1 and NTCIR-10 1CLICK-2. Letting M be a set of iUnits identified in 
a summary, S-measure is defined as 


S—measure = x 5 w(u) max(0, | — offset(u)/L), (11.2) 


ueM 


where M is a normalization factor, w(u) is the weight of an iUnit u, L is a patience 
parameter, and offset(u) is the offset of an iUnit u in the summary (more precisely, 
it is the number of characters between the beginning of the summary and the end 
of the iUnit). This measure basically represents the sum of the weight (w(u)) with 
offset-based decay (1 — offset(u)/L) for iUnits in a summary. Figure 11.2 illustrates 
S-measure computation with a simple example. As shown in the figure, the decay 
is assumed to decrease linearly with respect to the offset of an iUnit, and totally 
cancels the value of an iUnit appearing after L characters (the maximum function 
simply prevents the decay from being negative). Thus, the patience parameter can 
be interpreted as how many characters can be read by the user, or, alternatively, how 
much time the user can spend to read the summary when it is divided by the reading 
speed. For example, L = 500 in Fig. 11.2. If the reading speed is 500 characters per 
minute for average Japanese users, this patience parameter indicates that the user 
spends only a minute and leaves right after a minute passes. This corresponds to the 
fact that the decay factor becomes zero (or no value) after 500 characters. 

The normalization factor M sets the upper bound so that S ranges from 0 to 1, 
and is defined as 


N= > w(u) max(0, 1 — offset*(v(u))/L), (11.3) 


ucU 


where U is a set of all iUnits and offset* (v(u)) is the offset of the vital string of an 
iUnit u in Pseudo Minimal Output (PMO), which is an ideal summary artificially 
created for estimating the upper bound. The PMO was obtained by sorting all vital 
strings by w(u) (first key) and |v(u)| (second key) and concatenating them. Note that 
this procedure of generating an ideal summary may not be optimal, yet it is not a 
serious problem in practice as discussed in the original paper (Sakai et al. 201 1a). 

Finally, the original notation of S-measure is shown below, though it is obviously 
equivalent to Eq. 11.2: 


Yeu w(u) max(0, L — offset(w)) 
Sey w(u) max(0, L — offset*(v(u))) ° 


S-measure = (11.4) 
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# of characters read 
(Reading time) 


Output Lg | | L | 


Fig. 11.2 Illustration of S-measure computation. The x-axis represents the number of characters 
read by the user, and y-axis represents the offset-based decay (max (0, | — offset(u))/L) with L = 
500. The x-axis can also be interpreted as reading time indicated in the parentheses when the reading 
speed is 500 characters per minute. The textual output located at the bottom includes three iUnits 
u1, u2, and u3. The position of iUnits is aligned to the x-axis and their offsets are 125, 250, and 500, 
respectively. Their weight is 1 for simplicity. S-measure for this textual output can be computed as 
S-measure = 57 (1 - 0.75 + 1 -0.50 + 1 - 0.00) = 5p - 1.25 


11.3.4 M-Measure 


M-measure (Kato et al. 2016a) was the primary evaluation metric at NTCIR-11 
MobileClick-1 and NTCIR-12 MobileClick-2, which was proposed for two-layered 
summaries. 

Intuitively, a two-layered summary is good if: (1) The summary does not include 
non-relevant iUnits in the first layer; (2) The first layer includes iUnits relevant for 
all the intents; and (3) iUnits in the second layer are relevant for the intent that links 
to them. 

To be more specific, the following choices and assumptions were made for eval- 
uating two-layered summaries: 


e Users are interested in one of the intents i € J, by following the intent probability 
P(i|q), where J, is a set of intents for query q. 
e Each user reads a summary following these rules: 


1. The user starts to read a summary from the beginning of the first layer. 

2. When reaching the end of a link /; which interests a user with intent 7, the user 
clicks on the link and starts to read its second layer s;. 

3. When reaching the end of the second layer s;, the user goes back to the end of 
the link /; and continues reading. 

4. The user stops after reading no more than L characters. 
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e The weight of iUnits is judged per intent. Therefore, an iUnit is important for a 
user but may not be important for another user. 

e The utility of text read by a user is measured by U-measure proposed by Sakai and 
Dou (2013), which consists of a position-based gain and a position-based decay 
function. 

e The evaluation metric for two-layered summaries, M-measure, is the expected 
utility of text read by different users. 


These choices and assumptions derive all possible trailtexts and their probabil- 
ity in a two-layered summary. A trailtext is a concatenation of all the texts read 
by a user, and can be defined as a list of iUnits and links consumed by the user. 
According to the user model described above, a trailtext of a user who is interested 
in intent i can be obtained by inserting a list of iUnits in the second layer s; after 


the link of l;. More specifically, given the first layer f = (uj,...,uj—1, li, uj,.--) 
and second layer s; = (Ui,1,..., Ui s|), trailtext t; of intent i is defined as follows: 
t; = (u1, ...,Uj-1, li, Ui,- Uis Uj,- --). An example of trailtexts in a two- 


layered summary is shown in Fig. 11.3. 
M-measure, an evaluation metric for the two-layered summarization, is the 
expected utility of text read by users: 


M=} POUM), (11.5) 
teT 
First layer f Second layer s4 Second layer sz 
L -— 
l2 coo —| 


Trailtext for Intent 1 


a ee A- l2 P 


Trailtext for Intent 2 


i j lz PA ee - - -- 


Fig. 11.3 Example of trailtexts in a two-layered summary. Suppose links /; and / are interesting for 
users with intents 1 and 2, respectively. All the users start to read the summary from the beginning 
of the first layer and read iUnits wu; and u2. A user with intent 1 clicks on link /;, reads the iUnits in 
the second layer sı, and goes back to the first layer for reading the rest. A user with intent 2 does 
not click on link /; but clicks on link l2, reads the iUnits in s2, and returns to the first layer. These 
different trails result in different trailtexts shown at the bottom of the figure 
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where T is a set of all possible trailtexts, P(t) is a probability of going through a 
trailtext t, and U (t) is the U-measure score of a trailtext t. 

For simplicity, a one-to-one relationship between links and intents was assumed 
in NTCIR-12 MobileClick-2. Therefore, there is only a relevant link and a trailtext 
for each intent. It follows that the probability of each trailtext being generated is 
equivalent to the probability of the corresponding intent, i.e., P(t;) = P(ilq) where 
t; denotes a trailtext read by users with intent i. Then, M-measure can be rewritten 
as 


M = }  P(ilg)U; (t). (11.6) 


iel, 


where U;(t;) is the U-measure score of a trailtext t; for users with intent i. 

The computation of U-measure (Sakai and Dou 2013) is the same as that of S- 
measure except for the normalization factor and definition of the weight. U-measure 
is defined as follows: 


Iti 


1 
Ui) = FFD 8i Udu), (11.7) 


j=! 


where g; (u ;) is the weight of iUnit u ; in terms of intent i, d is a position-based decay 
function, and M is a constant normalization factor (V=1 in NTCIR MobileClick). 
Note that a link in the trailtext is regarded as a non-relevant iUnit for the sake of 
convenience. The position-based decay function is the same as that of S-measure: 


d(u) = max (0, 1 — offset(u)/L). (11.8) 


11.4 Outcomes of NTCIR 1CLICK and MobileClick 


This section highlights the outcomes of NTCIR 1CLICK and MobileClick. We first 
present the results of each task and then discuss their potential impacts. 


11.4.1 Results 


Table 11.4 shows the number of participants and submissions at each NTCIR task. 
While the first round of 1CLICK and MobileClick failed to attract many participants, 
the second round of each received a sufficient number of submissions from ten or 
more teams. Due to a small number of participants, we only summarize results from 
1CLICK-2 and MobileClick-2. 
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Table 11.4 The number of participants and submissions at each NTCIR task 


Year NTCIR Task # of participants | # of submissions 

2010 9 1CLICK-1 3 10 

2011 10 1CLICK-2 10 38 (for the Main task) 

2013 11 MobileClick-1 | 4 24 (for retrieval) & 11 
(for summarization) 

2014 12 MobileClick-2 | 12 37 (for retrieval) & 29 
(for summarization) 


The NTCIR-10 1CLICK-2 results showed that simple use of search engine snip- 
pets and the first paragraph of Wikipedia articles outperformed more sophisticated 
approaches for both of the English and Japanese queries. Those simple approaches 
were particularly effective for celebrity query types, while they were not for the other 
types such as local queries (Kato et al. 2013b). 

The NTCIR-12 MobileClick-2 task results showed that some participants’ runs 
outperformed the baselines. Since the MobileClick task required systems to group 
iUnits relevant to the same intent, some teams proposed effective methods to measure 
the similarity between intents and iUnits, and achieved significantly better results than 
baselines. For example, one of the top performers used word embedding for mea- 
suring the intent-iUnit similarity, and another team proposed an extension of topic- 
sensitive PageRank for the summarization task. Per-query analysis at MobileClick-2 
also suggested that celebrity query types were easy, while local and Q&A types of 
queries are difficult for both baselines and participants’ systems (Kato et al. 2016b). 


11.4.2 Impacts 


An evaluation metric for summaries, ranked lists, and sessions, U-measure, was 
proposed by Sakai and Dou (2013). As they explained, U-measure was inspired by 
S-measure and is a generalization of S-measure. U-measure was further extended to 
the evaluation of customer-helpdesk dialogues by Zeng et al. (2017). 

Luo et al. (2017) proposed height-biased gain (HBG), an evaluation metric for 
mobile search engine result pages. HBG is computed by summing up the product of 
weight and decay that are both modeled in terms of result height in mobile search 
engine result pages. As the authors mentioned in their paper, U-measure is one of 
the evaluation metrics that inspired HBG. 

Arora and Jones (2017a,b) adapted the definition of iUnits for their study on 
identifying useful and important information and how people perceive information. 

In commercial search engines, direct answers or featured snippets have become 
an important part of the search engine result page. This functionality presents a 
text that answers a question given as a query, just like the textual output of NTCIR 
1CLICK. As of May 2019, it seems that they only show a part of a webpage and do 
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not summarize multiple webpages. The evaluation methodology of NTCIR 1CLICK 
and MobileClick could be potentially useful when direct answers are composed from 
multiple webpages and need to be evaluated in detail. 


11.5 Summary 


This chapter introduced the earliest attempts toward test-collection-based evaluation 
for information access with smartphones, namely, NTCIR 1CLICK and MobileClick. 
Those campaigns aimed to develop an IR system that outputs a single, short text sum- 
mary for a given query, which is expected to fit a small screen and to satisfy users’ 
information needs without requiring much interaction. This chapter mainly discussed 
the novelty of the evaluation methodology used in those evaluation campaigns by con- 
trasting it with ordinary summarization evaluation. Moreover, the potential impacts 
of NTCIR 1CLICK and MobileClick were discussed as well. 
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Chapter 12 A) 
Mathematical Information Retrieval gag 


Akiko Aizawa and Michael Kohlhase 


Abstract We present an overview of the NTCIR Math Tasks organized during 
NTCIR-10, 11, and 12. These tasks are primarily dedicated to techniques for search- 
ing mathematical content with formula expressions. In this chapter, we first sum- 
marize the task design and introduce test collections generated in the tasks. We 
also describe the features and main challenges of mathematical information retrieval 
systems and discuss future perspectives in the field. 


12.1 Introduction 


The NTCIR Math Tasks are aimed at developing test collections for mathemati- 
cal search in STEM (Science/Technology/Engineering/Mathematics) documents to 
facilitate and encourage research in mathematical information retrieval (MIR) (Liska 
et al. 2011) and its related fields (Guidi and Sacerdoti Coen 2016; Zanibbi and 
Blostein 2012). 

Mathematical formulae are important for the dissemination and communication 
of scientific information. They are not only used for numerical calculation but also for 
clarifying definitions or disambiguating explanations that are written in natural lan- 
guage. Despite the importance of math in technical documents, most contemporary 
information retrieval systems do not support users’ access to mathematical formulae 
in target documents. One major obstacle to MIR research is the lack of readily avail- 
able large-scale datasets with structured mathematical formulae, carefully designed 
tasks, and established evaluation methods. 

MIR involves searching for a particular mathematical concept, object, or result, 
often expressed using mathematical formulae, which—in their machine-readable 
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forms—are expressed as complex expression trees. To answer MIR queries, a search 
system should tackle at least two challenges: (1) tree structure search and (2) utiliza- 
tion of textual context information. 

To understand the problem, consider an engineer who wants to prevent an electrical 
system from overheating, thus, needs a tight upper estimate for the energy term 


b 
f VOIOldt 


for all a, b, where V is voltage and J current. Search engines, such as Google, are 
restricted to word-based searches of mathematical articles, which barely helps with 
finding mathematical objects because there are no keywords to search for. Computer 
algebra systems cannot help either since they do not incorporate the necessary special 
knowledge. However, the required information is out there, e.g., in the form of 


Theorem 17. (Hölder’s Inequality) 
Iff and g are measurable real functions, 1, h € R,and p,q € [0, 00), such that 1/p + 1/4 = 


1, then 
h h L ph 7 
i IF) dx <(f IFI dx) (J leco ax) 


For mathematical content (here the statement of Hölder’s inequality) to be truly 
searchable, it must be in a form in which an MIR system can find it from a query 


b 
[ |V(t)I(t)|dt <|R 
a 


the boxed identifiers are query variables (see Sect. 12.3.2)—and can even extend the 


calculation to 
b b b 
[worn <(f Ivor dx) (J Ire? dx) 


after the engineer chooses p = q = 2 (Cauchy—Schwarz inequality). Estimating the 
individual V and J values is now a much simpler problem. 

Admittedly, Google would have found the information by querying for “Cauchy— 
Schwarz Hölder”, but that keyword was the crucial information the engineer was 
missing in the first place. In fact, it is not unusual for mathematical document collec- 
tions to be so large that determining the identifier of the sought-after object is harder 
than recreating the actual object. 

In this example we see the effect of both (1) formula structure search and (2) 
context information as postulated above: 


Nie 
Nie 


1. The formula structure is mapped by unification (finding a substitution for the 
boxed query variables to make the query and main formula of Hélder’s inequality 
structurally identical or similar (see Sect. 12.3.2). 
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2. We have used the context information about the parameters of Hélder’s inequality, 
e.g., that the identifiers f, g, p, and q are universal (thus can be substituted for); 
the first two are measurable functions and the last two are real numbers. 


In the following sections, we summarize our attempts at NTCIR to develop 
datasets for MIR together with some future perspectives of the field. 


12.2 NTCIR Math: Overview 


Prior to the NTCIR Math Tasks, MIR had been mainly approached by researchers in 
digital mathematics libraries, and only a little attention has been paid by the informa- 
tion retrieval community. Unlike other scientific disciplines that require a search for 
specific types of named entities such as genes, diseases, and chemical compounds, 
mathematics is based on abstract concepts with many possible interpretations when 
mapped to a real-world phenomenon. This means that although their mathematical 
definitions are rigid, mathematical concepts are inherently ambiguous in their appli- 
cations to the real world. Also, the representation of mathematical formulae can be 
highly complicated with diverse types of symbols including user-defined functions, 
constants, and free and bound variables. As such, MIR requires dedicated search 
techniques such as approximate tree matching or unification. To summarize, in the 
context of information retrieval, MIR is not only a challenge for novel retrieval tar- 
gets but also featured as a testbed for (1) retrieval of non-textual objects in documents 
using their context information and (2) a large-scale complex tree structure search 
with a realistic application scenario. 

The NTCIR Math tasks were the first trial to introduce an evaluation framework 
of information retrieval to mathematical formula search. NTCIR Math Tasks were 
organized three times during NTCIR-10, 11, and 12, i.e., the NTCIR-10 Math Pilot 
Task, NTCIR-11 Math-2 Task, and NTCIR-12 MathIR Task. 


12.2.1 NTCIR-10 Math Pilot Task 


The NTCIR-10 Math Pilot Task (Aizawa et al. 2013) was the first attempt to develop 
a common workbench for mathematical formula search. This task was organized as 
two independent subtasks: 


1. The first was the Math Retrieval Subtask in which the objective was to retrieve 
relevant documents given a math query. 

2. The second was the Math Understanding Subtask in which the objective was to 
identify textual spans that describe math formulae that appear in the document. 
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The corpus used for this task was based on 100,000 arXiv documents converted from 
IATEX to XHTML by the arXMLiv project.! 

Six teams participated in this task, all six contributing to the Math Retrieval 
Subtask and only one to the Math Understanding Subtask. 


12.2.2 NTCIR-11 Math-2 Task 


The NTCIR-10 Math Pilot Task showed that participants considered the Math 
Retrieval Subtask more important. Therefore, the succeeding two tasks focused only 
on this subtask and made it as compulsory for all participants. In the NTCIR-11 
Math-2 Task (Aizawa et al. 2014), based on the feedback from the participants in the 
pilot task, both the arXiv corpus and topics were reconstructed. Apart from this main 
subtask using the arXiv corpus, the NTCIR-11 Math-2 Task also provided an open 
free subtask using math-related Wikipedia articles. This optional subtask required 
an exact formula search (without any keywords) and complements the main subtask 
with an automated performance evaluation. 

The NTCIR-11 Math-2 Task had eight teams participating (two new teams joined), 
most contributing to both subtasks . 


12.2.3 NTCIR-12 MathIR Task 


For the NTCIR-12 MathIR Task (Zanibbi et al. 2016), we reused the arXiv corpus we 
prepared for the NTCIR-11 Math-2 Task but with new topics. This subtask introduced 
a new formula query operator, simto region, that explicitly requires an approximate 
matching function for math formulae. We also created a new corpus of Wikipedia 
articles to provide a use case of math retrieval by nonexperts. The design of the 
subtask for the Wikipedia corpus was similar to that in the NTCIR-11 Math-2 Task 
except that a topic includes not only exact formula search but also formulat+keyword 
search (Table 12.1). 
Six teams participated in the NTCIR-12 MathIR Task. 


12.3 NTCIR Math Datasets 


In this section, we mainly describe the two datasets, arXiv and Wikipedia, designed 
for the Math Retrieval Subtasks during NTCIR-12. Each dataset consists of a corpus 
with mathematical formulae, a set of topics in which each query is expressed as 


‘https://kwarc.info/projects/arXMLiv/. 
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Table 12.1 Summary of NTCIR math subtasks 


Subtasks NTCIR-10 NTCIR-11 NTCIR-12 
Math Retrieval | Formula search do o E 
Subtask for the 
ArXiv cor pus 
Formula+keyword search O O O 
Formula+keyword search O 
with “simto” 
Free-form query search O 
Math Retrieval | Formula search O O 
Subtask for the 
Wikipedia cor 
pus 
Formula+keyword search O 
Formula+keyword search 
with ‘simto’ 
Math understanding subtask O 


a combination of mathematical formulae schemata and keywords, and relevance 
judgment results based on the submissions from participating teams. 


12.3.1 Corpora 


The arXiv corpus contains paragraphs from technical articles in the arXiv,? while the 
Wikipedia corpus contains complete articles from Wikipedia. Generally speaking, 
the arXiv articles (preprints of research articles) were written by technical experts for 
technical experts assuming a high level of mathematical sophistication from readers. 
In contrast, many Wikipedia articles on mathematics were written to be accessible 
for nonexperts at least in part. 


12.3.1.1 ArXiv Corpus 


The arXiv corpus consists of 105,120 scientific articles in English. These articles were 
converted from IATEX sources available at http://arxiv.org to HTML5+MathML using 
the LaTeXML system? and include the arXiv categories math, cs, physics:math- 
ph, stat, physics:hep-th, and physics:nlin to obtain a varied sample of technical 
documents containing mathematics. 


*http://www.arxiv.org. 
3http://dimf.nist.gov/LaTeXML/. 
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Fig. 12.1 Math formulae statistics for the arXiv corpus 


This subtask was designed for both formula-based search systems and document- 
based retrieval systems. In document-wise evaluation, human evaluators need to 
check all math formulae in the document. To reduce the cost of relevance judgment, 
we divided each document into paragraphs and used them as the search units (“doc- 
uments”) for the subtask. This produced 8,301,578 search units with roughly 60 
million math formulae (including isolated symbols) encoded using ATEX, Presen- 
tation MathML, and Content MathML Formulae*; 95% of the retrieval units had 23 
or fewer math formulae, which is sufficiently small for document-based relevance 
judgment by human reviewers. Excerpts are stored independently in separate files, 
in both HTMLS5 and XHTMLS formats. 

Figure 12.1 summarizes the basic statistics for the math formula trees in the ArXiv 
corpus. Figure 12.1a—d correspond to the distributions of the total number of nodes, 
maximum tree depth, average number of child nodes, and total number of leaf nodes 
in each math formula, respectively. These statistics show that the math trees in the 
arXiv corpus approximately follow the power-law distribution in their size. While 
there exists a vast amount of relatively simple trees, there also exists a non-negligible 
number of highly complex trees. This clearly shows that, as a benchmark for tree 


4MathML (Ausbrooks et al. 2010) supplies two sub-languages: Presentation MathML encodes the 
visual (and possibly aural) appearance of the formulae in terms of a tree of layout primitives and 
Content MathML encodes the functional structure of formulae in terms of an operator tree. 
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structure search, the corpus is characterized by its large scale as well as the hetero- 
geneity of the trees in it. 


12.3.1.2 Wikipedia Corpus 


The Wikipedia corpus contains 319,689 articles from English Wikipedia converted 
into a simpler XHTML format with images removed (5.15 GB uncompressed).° 
Unlike the arXiv corpus, articles were not split into smaller documents since they 
were simple/small enough for human annotation. Only 10% of the articles of the 
Wikipedia corpus contain explicit <math> tags that demarcate LATEX, reflecting 
the small proportion of articles related to math in Wikipedia, while keeping the corpus 
size manageable for participants. All articles with a <math> tag were included in 
the corpus and the remaining 90% were sampled from articles that do not contain 
any <math> tag. These “text” articles act as distractors for keyword matching. 
There are over 590,000 formulae in the corpus with the same format as the arXiv 
corpus, i.e., encoded using IŻTEX, Presentation MathML, and Content MathML. Note 
that untagged formulae frequently appear directly in HTML text (e.g. ‘where x 
<sup>2 ...'’).Wemade no attempt to detect or label these formulae embedded 
in the main text. 


12.3.2 Topics 


The Math Retrieval Subtasks were designed so that all topics include at least a single 
relevant document in the corpus, and ideally multiple relevant documents. In some 
cases, this is not possible, for example, with navigational queries where a specific 
document is sought after. 


12.3.2.1 Topic Format 


Details about the topic format are available in the documentation provided by the 
organizers (Kohlhase 2015). For participants, a math retrieval topic contains a (1) 
topic ID and (2) query (formula + keywords), but no textual description. The descrip- 
tion is omitted to avoid participants biasing their system design toward the specific 
information needs identified in the topics. For evaluators, each topic also contains 
a narrative field that describes a user situation, the user’s information needs, and 
relevance criteria. Formula queries are encoded in I4TRX, Presentation MathML, 
and Content MathML. In addition to the standard MathML notations, the following 
two subtask-specific extensions are adopted : formulae query variables and formula 
simto regions (see below). 


Shttp:// www.cs.rit.edu/~rlaz/NTCIR12_MathIR_WikiCorpus_v2.1.0.tar.bz2. 
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Formulae Query Variables (Wildcards). Formulae may contain query variables 
that act as wildcards, which can be matched to arbitrary subexpressions on candidate 
formulae. Query variables were represented using two different representations for 
the arXiv and Wikipedia topics. For the arXiv topics, query variables are named and 
indicated by a question mark (e.g., ?v) while for the Wikipedia topics, wildcards are 
numbered and appear between asterisks (e.g., */*). 

This is an example query formula with the three query variables ?f, ?v, and ?d. 


HOV + 2d) — AOV) 


d (12.1) 


This query matches the argument of the limit on the right side of the equation below, 
substituting g for ?f, cx for ?v, and h for ?d. Note that each repetition of a query 
variable matches the same subexpression. 


j : +h) - 
g'(x) = lim EE ) geg (12.2) 


Formula Simto Regions. Similarity regions modify our formula query language, 
distinguishing subexpressions that should be identical to the query from those that 
are similar to the query in some sense. Consider the query formula below, which 
contains a similarity region called “a.” 


fl 
g(cx +h) = g(cx) 
h 


(12.3) 


The fraction operator and numerator h should match exactly, while the numerator 
may be replaced by a “similar” subexpression. Depending on the notion of similarity 
we choose to adopt, simto region “a” might match “g(cx + h)+g(cx)”, if addition 
is similar to subtraction, or “g(cx + h) — g(dx)”, if c is somehow similar to d. The 
simto regions may also contain exact match constraints (see Kohlhase 2015). 


12.3.2.2 ArXiv Topics 


A total of 50 and 37 topics were provided during NTCIR-11 and NTCIR- 12, respec- 
tively. Many of the topics in the arXiv subtask are sophisticated, for example, seeking 
to determine whether a connection exists between a factorial product and products 
starting with one. Some queries are simpler, such as looking for applications of oper- 
ators, or loss functions used in machine learning. Eight out of the 37 topics during 
NTCIR-12 contained simto regions. 
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12.3.2.3 Wikipedia Topics 


Topics for the Wikipedia subtask were designed with a less expert user population 
in mind. We imagined undergraduate and graduate students searching Wikipedia to 
locate or remember and relocate specific articles (i.e. navigational queries), browse 
math articles, learn/review mathematical concepts and notation they come across in 
their studies, find applications of concepts, or find information to help solve particular 
mathematical problems (e.g., for homework). A total of 30 topics were provided 
during NTCIR-12. 


12.3.3 Relevance Judgment 


The evaluation of the Math Retrieval Subtasks was pooling-based. First, all submitted 
results were converted into a trec_eval result file format. Next, for each topic, the 
top-20 ranked documents were selected from each run. Then, the set of pooled hits 
were evaluated by human assessors. After the pooling process, the selected retrieval 
units were fed into the SEPIA system® with MathML extensions developed by the 
organizers. Evaluators judged the relevance of each retrieval unit by comparing it 
to the query formulae and keywords, along with the described scenario provided 
with the topic, and selected one of the judgments relevant (R), partially 
relevant (PR), or not-relevant (N). The retrieval units were documents 
except for Wikipedia formula-only subtask, where the evaluation was based on indi- 
vidual formulae. 

Evaluators had to rely on their mathematical intuition, the described information 
needs, and actual query to determine judgments. For the arXiv dataset, to ensure suf- 
ficient familiarity with mathematical documents, three evaluators were chosen from 
third-year and graduate students of (pure) mathematics. Each topic was evaluated by 
at least two evaluators. For the Wikipedia dataset, intended to represent mathemati- 
cal information needs for nonexperts, ten students were recruited for evaluation: five 
undergraduates and five graduate (MSc) students. The Fleiss’ « values were 0.5615 
and 0.5380 for the arXiv dataset and 0.3546 and 0.2619 for the Wikipedia dataset. 
Agreement between evaluators for the arXiv dataset was higher. This may be because 
of the greater mathematical expertise and shared background by these evaluators. 


Shttps://code.google.com/p/sepia/. 
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12.4 Task Results and Discussion 


12.4.1 Evaluation Metrics 


In our evaluation, the judgment of each evaluator was converted into a relevance score 
using the mappings “Relevant” — 2, “Partially Relevant” — 1, and “Not Relevant” 
— 0. Then, the average score was binarized as follows: 


e For “relevance” evaluation, the overall judgment is considered relevant if the 
average score is equal or greater than 1.5, and not relevant otherwise. 

e For “partial relevance” evaluation, the overall judgment is considered relevant 
if the average score is equal or greater than 0.5, and not relevant otherwise. 


Precision@k for k = {5, 10, 15, 20} was used to evaluate participating systems. We 
chose these measures because they are simple to understand and characterize retrieval 
behavior as the number of hits increases. Precision@k values were obtained from 
trec_eval version 9.0, with which they were labeled P_avgjg_5, 
P_avgjg_10, P_avgjg_15, and P_avgjg_20, respectively. 


12.4.2 MIR Systems 


The numbers of participating teams were 6, 8, 6 for the NTCIR 10, 11, 12 Math Tasks. 
Three teams participated in all three tasks. For NTCIR 11 and 12, there were one 
or two new participating teams. The architectures of the participating systems were 
quite diverse. For formula encodings, all the TEX, MathML Presentation Markup, 
MathML Content Markup formats were used by at least one system; Presentation 
Markup was the most popular notation. Also, the majority of systems used a general- 
purpose search engine for indexing. 

The following common technical decisions should be considered in designing 
MIR systems. 


12.4.2.1 How to Index Math Formulae? 


Mathematical formulae are expressed as XML tree structures, which often become 
very complex. However, the search sometimes requires approximate matching to 
guarantee certain flexibility. There are two strategies for indexing math formulae: 
token-based and subtree-based. While token-based indexing takes into account math 
tokens, the same as words in a text, subtree-based indexing decomposes the XML 
structure into smaller fragments, i.e.,subtrees, and treats them as indexing units. In the 
NTCIR Math Tasks, the majority of systems took into account structural information 
for formulae. 
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12.4.2.2 How to Deal with Query Variables? 


One of the prominent features of MIR is that a query formula can contain “variables”, 
i.e., symbols that can serve as named wildcards. Since the unification operation is 
expensive, most participating systems used a re-ranking step, wherein one or more 
initial rankings are merged and/or reordered. This approach of obtaining an initial 
candidate ranking followed by a refined ranking is a common and effective strategy. 
To locate strong partial matches, all the automated systems used unification, whether 
for variables (e.g., “x- + y? = z?” unifies with “a? + b? = C2”), constants, or entire 
subexpressions (e.g., via structural unification or indirectly through generalized terms 
with wildcards for operator arguments). 


12.4.2.3 Other Technical Decisions 


Other issues include how to identify the importance of the keywords/math formulae 
in queries and documents; exploit context information; normalize math formulae 
with possibly many notation variations; deal with ambiguity in the original LATRX 
notation; combine keyword-based search with math formula search; and deal with 
“simto”’-type queries. To summarize, there can be many options for MIR system 
design, and they should be balanced with computation cost. 


12.5 Further Trials 


The NTCIR Math Tasks also contain several important trials that lead to further 
exploration in succeeding research, as detailed below. 


12.5.1 ArXiv Free-Form Query Search at NTCIR-10 


The NTCIR-10 Math Pilot Task contained 19 open queries from mathematicians 
expressed as free descriptions with natural language text and formulae. Here is an 
example (NTCIR10-OMIR-19): 


Let Xn be a decreasing sequence of nonempty closed sets in a Banach space such that their 
diameter tends to 0. Is their intersection nonempty? 


These topics were collected from questions asked by mathematicians in related 
forums, which makes the task settings more realistic and general. Since convert- 
ing the textual descriptions into “keyword+formula” queries requires deep natural 
language comprehension, we did not pursue this direction further in this task. How- 
ever, real queries in forums are an important resource for analyzing user information 
needs in their retrieval (Mansouri et al. 2019; Stathopoulos and Teufel 2015). 
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The Answer Retrieval for Questions on Math (ARQMath) is a newly launched 
task for the 11th Conference and Labs of the Evaluation Forum (CLEF 2020).’ Data 
from Math Stack Exchange, a mathematics-dedicated question answering forum, 
are expected to be used for ARQMath. Such explorations are expected to give further 
insights into realistic information needs. 


12.5.2 Wikipedia Formula Search at NTCIR-11 


The NTCIR-11 Math-2 Task provided the first open platform for comparing formula 
search engines, based upon their ability to retrieve specific formula in Wikipedia 
articles (Schubotz et al. 2015). By using formula-only queries that require an exact 
match of the math tree structure, the platform enables automatic evaluation without 
any human intervention. Regardless of the simplicity of the task, the automatic eval- 
uation framework was useful in verifying and tuning the formula search function of 
math search engines. This will enable us to establish leaderboard-style comparison 
of different strategies for complicated large-scale formula searches. 


12.5.3 Math Understanding Subtask at NTCIR-10 


The goal of the Math Understanding Subtask was to extract natural language defini- 
tions of mathematical formulae in a document for their semantic interpretation. The 
dataset for this subtask contains 10 manually annotated articles used in a dry run and 
an additional 35 used in a formal run. 

A description is obtained from a continuous text region or concatenation of some 
discontinuous text regions. Shorter descriptions may also be obtained from a longer 
one. For instance, in the text “log (x) is a function that computes the natural logarithm 
of the value x”, the complete description of “log(x)” is “a function that computes 
the natural logarithm of the value x”. Moreover, the shorter descriptions “a function” 
and “a function that computes the natural logarithm” can be obtained from the pre- 
vious one. This corpus defines two types of possible descriptions of mathematical 
expressions, namely full description (contains the complete type) and short descrip- 
tion (contains the short type). Participants could extract any type of description in 
their submission. 

The training and test set consists of 35 and 10 annotated papers selected from the 
arXiv copus, respectively. Inter-annotator agreement was tested for the five papers 
taken from the corpus. There are three measurements to test the reliability of annota- 
tion: Fl-score, Cohen’s kappa, and Krippendorff’s alpha. To compute the Fl-score, 
the position of the annotated descriptions from two annotators is strictly matched. 


Thetps://www.cs.rit.edu/~dprl/ARQMath/. 
Shttps://math.stackexchange.com/. 


12 Mathematical Information Retrieval 181 


The Fl-score was 0.8670, Cohen’s kappa was 0.8993, and Krippendorff’s alpha was 
0.7630 for full descriptions, and F1-score was 0.9014 for full and short descriptions). 
The evaluation was conducted by matching the position of the extracted descriptions 
against the positions of gold-standard descriptions, and precision, recall, and F1- 
score were used. 

Math-description extraction is considered important to combine mathematical 
formulae with their textual descriptions for their interpretation. For example, Kris- 
tianto et al. (2017) combined the description extraction with formula dependency 
extraction and obtained consistent improvement in the Math Retrieval Subtasks in 
the succeeding NTCIR Math Tasks. 


12.6 Further Impact of NTCIR Math Tasks 


Several years after these NTCIR Math Tasks, we witnessed a number of valuable 
developments in mathematical content access studies. This section provides a brief 
introduction to some of these activities, although it is far less comprehensive. 


12.6.1 Math Information Retrieval 


Since these NTCIR Math Tasks, increasing attention has been paid to semantic 
retrieval of mathematical formulae. NLP techniques often play a critical role in 
bridging the gap between presentation and semantic representations of math formu- 
lae. Recent studies on this topic include variable typing (Stathopoulos et al. 2018), 
using the textual context for transformation from a presentation level to semantic 
level (Schubotz et al. 2018), and identifying declarations of mathematical objects 
(Lin et al. 2019). 

Overall, there are several valuable approaches to MIR, including those we could 
not introduce in this book chapter. According to the number of citations on Semantic- 
Scholar,’ the overview papers of the Math Tasks during NTCIR-10, 11, and 12 have 
39, 39, 33 citations, respectively, as of December 2019. MIR is also characterized 
by the diversity of the conferences and journals of the related papers, including such 
fields as mathematics, information retrieval, image recognition, NLP, knowledge 
management, and document processing. 


°https://www.semanticscholar.org. 
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12.6.2 Semantics Extraction in Mathematical Documents 


Noteworthy recent work includes a general-purpose part-of-math tagger that per- 
forms semantic disambiguation and parsing of math formulae (Youssef 2017) and 
embeddings of math symbols (Mansouri et al. 2019; Youssef and Miller 2019). It has 
also been reported that image-based math-formula search is also capable of capturing 
semantic similarity without unification (Davila et al. 2019). Other related topics that 
were not addressed during the NTCIR Math Tasks include math document catego- 
rization (Barthel et al. 2013) using formulae information (Suzuki and Fujii 2017). 


12.6.3 Corpora for Math Linguistics 


The development work for the arXiv corpus (and the subsequent requests by the 
community) made it very clear that work on document understanding and information 
in Mathematics and STEM can only succeed based on large and shared document 
corpora. A single conversion run over the arXiv corpus (over 1.5 Million documents) 
is a multi-processor-year enterprise generating 108 — 10° error reports in gigabytes 
of log files. 

To support and manage this computational task, the corTgXsystem ` has been 
developed as a general-purpose processing framework for corpora of scientific doc- 
uments. The licensing issues involved in distributing the ensuing corpora have led to 
the recent establishment of Special Interest group for Math Linguistics (SIGMath- 
Ling), '! a forum and resource cooperative for the linguistics of mathematical and 
technical Documents. The problem is that many of the mathematical corpora (e.g., 
the arXiv corpus or the 3 Million abstracts of zoMATH'”) are not available under 
a license that allows republishing. While the copyright owners are open towards 
research, they cannot afford to make the corpora public. SIGMathLing hosts such 
data sets in corpus cooperative: Researchers in mathematical semantics extraction 
and information retrieval sign a cooperative non-disclosure agreement, get access 
to the data sets and can deposit derived data sets in the cooperative. Data sets have 
dedicated landing pages so that they can be cited. A prime example of a data set is 
the XHTML5+MathML version of the arXiv corpus up to August 2019.!? 


10 


‘Ohttps://github.com/dginev/CorTeX. 

'https://sigmathling.kwarc.info/. 

 http://zbmath.org. 

13The landing page is at https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/. 


12 Mathematical Information Retrieval 183 


12.7 Conclusion 


The NTCIR Math Tasks were an initial attempt in facilitating the formation of an 
interdisciplinary community of researchers interested in the challenging problems 
underlying MIR. The diversity of approaches reported at NTCIR shows that research 
in this field is active. We witnessed the progress of participating systems since the 
NTCIR-10 Pilot Task; improving scalability or addressing result ranking in new 
ways. 

The design decision of the arXiv subask to exclusively concentrate on for- 
mula/keyword queries and use paragraphs as retrieval units made the retrieval task 
manageable but has also focused research away from questions such as result pre- 
sentation and user interaction. In particular, few systems have invested in further 
semantics extraction from a corpus and used that in the search process to further 
address information needs. We feel that this direction should be further addressed in 
future tasks. 

Ultimately, the success of MIR systems will be determined by how well they are 
able to accommodate user needs in terms of the adequacy of the query language, trade- 
off between query language expressiveness/flexibility, and answer latency on the one 
hand and learnability on the other. Similarly, the result ranking and monetization 
strategies for MIR are still a largely uncharted territory; we hope that future MIR 
tasks can help make progress on this front. 
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Abstract Lifelogging can be described as the process by which individuals use var- 
ious software and hardware devices to gather large archives of multimodal personal 
data from multiple sources and store them in a personal data archive, called a lifelog. 
The Lifelog task at NTCIR was a comparative benchmarking exercise with the aim 
of encouraging research into the organisation and retrieval of data from multimodal 
lifelogs. The Lifelog task ran for over 4 years from NTCIR-12 until NTCIR-14 
(2015.02—2019.06); it supported participants to submit to five subtasks, each tack- 
ling a different challenge related to lifelog retrieval. In this chapter, a motivation is 
given for the Lifelog task and a review of progress since NTCIR-12 is presented. 
Finally, the lessons learned and challenges within the domain of lifelog retrieval are 
presented. 
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13.1 Introduction 


Recent advances in computing technology and wearable sensors mean that individ- 
uals are now in a position to log data about their lives on a continual basis, with little 
manual effort required. These data logs are often called lifelogs, and the process of 
gathering them is referred to as lifelogging. Lifelogging typically occurs in a passive 
manner (i.e. using sensors and not relying on human input). A commonly used defini- 
tion of lifelogging is as ‘a form of pervasive computing, consisting of a unified digital 
record of the totality of an individual’s experiences, captured multimodally through 
digital sensors and stored permanently as a personal multimedia archive’ (Dodge 
and Kitchin 2007). Lifelogging can generate enormous (potentially multi-decade) 
archives that are too large for manual organisation. What sets lifelogging apart from 
conventional personal data organisation challenges (e.g. photos or emails) is the fact 
that lifelogs, being captured passively, are typically continuous in nature and are non- 
curated archives. Hence, these lifelogs pose a significant challenge for researchers 
to develop appropriate information organisation and retrieval approaches. 

In the past 15 years, lifelogging has been receiving increasing research attention 
across a range of domains, including multimedia analytics, event-based computing, 
pervasive computing, information retrieval, as well as various application domains 
such as memory-science, wellness and epidemiological studies. A detailed overview 
of the early research activities on lifelogging is provided by (Gurrin et al. 2014b), 
and we refer the reader to that overview. Prior to NTCIR-12, there was no forum that 
could support a comparative evaluation of approaches to lifelog data organisation 
and retrieval, nor were the suitable datasets, nor was there even consensus on which 
of the many potential research challenges were the most important. The Lifelog 
task at NTCIR-12 was proposed because the organisers identified that lifelogging 
had potential to become a relatively commonplace activity, thereby necessitating the 
development of new approaches to multimodal personal data analytics and retrieval. 
New personal sensors were coming to market, such as wearable cameras, AR glasses, 
various forms of fitness trackers and so on. These were generating large multimodal 
archives for individuals yet, as with many new technologies, the required organisation 
tools had not been considered. It is the belief of the organisers that such vast archives 
of personal data require search and automatic annotation as fundamental underlying 
technologies upon which various other applications can be built; hence, the Lifelog 
task was proposed. 


13.2 Related Activities 


Lifelog data has been used in many domains as a source of multimodal data log- 
ging the real-world activities of one, or more, individuals. From prior research, we 
note that lifelogging tools were applied in the domains of long-term memory under- 
standing (Milton et al. 2011), supporting human recollection (Barnard et al. 2011), 
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supporting human memory (Berry et al. 2007; Harvey et al. 2016), facilitating large- 
scale epidemiological studies in healthcare (Signal et al. 2017), lifestyle monitoring 
at the individual level (Nguyen et al. 2016; Wilson et al. 2018), behaviour analytics 
(Everson et al. 2019), diet/obesity analytics (Zhou et al. 2019), or for exploring soci- 
etal issues such as privacy-related concerns (Hoyle et al. 2014). For many of these 
domains of application, the lifelog data was gathered and analysed by humans in 
order to draw conclusions for their research tasks. 

In terms of actual functional retrieval systems for lifelog data, a number of early 
retrieval engines had been developed prior to NTCIR-12, such as the MyLifeBits 
system (Gemmell et al. 2002) or the Sensecam Browser (Lee et al. 2008). Both 
of these systems were browsing engines, rather than search engines, and relied on 
a database metaphor to support access. Subsequently, it was found that a faceted- 
multimodal search engine (even a simple one) was many times faster and more 
effective than browsing systems at finding known items from large lifelogs (Doherty 
et al. 2012), yet there were few search engines designed for lifelog data and no 
means of comparing their effectiveness. This means that prior to the Lifelog task at 
NTCIR- 12, there were no comparative benchmarking activities and comparative and 
reproducible research on lifelogging was rather sparse. The main reason for this was 
the lack of publicly available lifelog datasets, which was due to the highly personal 
nature of lifelog data and the related requirement to guarantee people’s privacy when 
releasing such datasets for widespread use. 

The NTCIR-12 Lifelog pilot task (Gurrin et al. 2016) introduced the first shared 
test collection for lifelog data and attracted the first cohort of participants to, what 
was at the time, a very novel and challenging task. Since this initiative at NTCIR- 
12, there have been two related activities at alternative venues; one at ImageCLEF 
(Dang-Nguyen et al. 2017a, 2018) which focused on a series of image-retrieval and 
summarisation focused benchmarking initiatives since 2017, and the Lifelog Search 
Challenge (LSC) (Gurrin et al. 2019b) which was modelled on the successful Video 
Browser Showdown (Lokoc et al. 2018). The LSC encourages participants to develop 
interactive search engines for lifelog data and evaluate them in a public forum. The 
LSC has run at the annual ACM ICMR conference since 2018. 

Specifically in relation to standalone retrieval efforts, early research on lifelog 
retrieval has focused on using images as unit of retrieval (e.g. Lee et al. 2008) with 
some early work in supporting user browsing these image collections (Doherty et al. 
2011), or on the use of maps metadata, such as GPS locations, to organise content 
visually (Chowdhury et al. 2016). Once again, we refer the reader to (Gurrin et al. 
2014b) for an overview of early efforts at lifelog search and retrieval. Significant 
efforts also went into the development of graphical user interfaces to visualise the 
data and also provide a positive user experience. Many good examples of interactive 
interfaces can be seen in the systems developed for the interactive Lifelog Search 
Challenge since 2018. 
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13.3 Lifelog Datasets Released at NTCIR 


Over the course of the three most recent NTCIR workshops, the Lifelog task intro- 
duced three new datasets. The datasets were developed to represent a multimodal 
digital surrogate of the life activities of a number of individuals as they go about their 
daily lives, over an extended period of time (weeks or months). These datasets rep- 
resented unprecedented data-rich archives for a number of individuals, pushing the 
boundaries of what was feasible to collect and distribute in an ethically and legally 
acceptable manner. Each dataset was gathered by either two or three lifeloggers, who 
wore/carried with them various lifelogging devices and gathered activity/biometric 
data for most (or all) of the waking hours in the day. The three datasets contained 
images from passive-capture wearable cameras as the core of each dataset. The 
passive-capture wearable camera was either clipped to clothing or worn on a lan- 
yard around the neck, which captured images (from the wearer’s viewpoint) and 
operated for 12—14h per day (1,250-4,500 images per day—depending on capture 
frequency, camera type, or length of waking day). For examples of images captured 
by such wearable cameras, see Fig. . Additionally, mobile phone apps gathered 
contextual data such as location or physical movements and additional sensors (e.g. 
smartwatches or biometric-testing sensors) provided health and wellness data. 


Fig. 13.1 Examples of Wearable Camera Images (Narrative Clip from NTCIR-13) 


13 Experiments in Lifelog Organisation and Retrieval at NTCIR 191 


Typically, the datasets consist of: 


e Multimedia Content: Wearable camera images captured at a rate of about two 
images per minute and worn from breakfast to sleep. Accompanying this image 
data for NTCIR-13/14 was a time-stamped record of music listening activities 
sourced from Last.FM! and (for NTCIR-14) an archive of all conventional 
(active-capture) digital photos taken by the lifelogger. 

e Biometrics Data: Using off-the-shelf fitness trackers,” the lifeloggers gathered 
24 x 7 heart rate, caloric burn and steps. In addition, for NTCIR-2014, continuous 
blood glucose monitoring was added which captured readings every 15 min using 
the Freestyle Libre wearable sensor.* 

e Human Activity Data: The daily activities of the lifeloggers were captured in 
terms of the semantic locations visited, physical activities (e.g. walking, running, 
standing) from the Moves app,’ along with (for NTCIR-14) a time-stamped diet 
log of all food and drink consumed. 

e Enhancements to the Data: The wearable camera images were annotated with 
the outputs of various visual concept detectors which described in textual form the 
content of the lifelog images. 


Readers who are interested in more information on the three lifelog datasets are 
referred to the task overview papers for NTCIR-12 (Gurrin et al. 2016), NTCIR- 
13 (Gurrin et al. 2017) and NTCIR-14 (Gurrin et al. 2019a). See Table 13.1 for a 
summary comparison of the three datasets. 

What makes lifelog dataset generation a challenging task is the personal nature 
of real lifelog data (Chaudhari et al. 2007; Dang-Nguyen et al. 2017b) which must 
be gathered and released in a carefully organised process. One, or more, individuals 
must be willing to share a digital representation of their real-world activities with 
both researchers and the community. Aside from the difficulties of finding lifeloggers 
willing to share, various legal and institutional requirements needed to be met, such 
as passing review by an institutional ethics board, and for NTCIR- 14, the preparation 
of a Data Protection Impact Assessment (to meet European GDPR requirements). 
Datasets were made available via the NTCIR-Lifelog website” and were password 
protected and secured by HTACCESS with username/password pairs generated for 
each participant. Additionally, in a style similar to TREC, each participating organi- 
sation needed an appropriate representative to sign an organisational agreement form 
and send it to the task organiser. Individual agreement forms were maintained by the 
participating organisation on behalf of each task participant within that organisation. 

Prior to release, each dataset was subject to a detailed multi-phase redaction 
process to anonymise the dataset in terms of the lifelogger’s identity as well as the 
identity of bystanders in the data. While many approaches have been proposed to 


‘Last.FM Music Tracker—https://www.last.fm/. 

For example, the Fitbit Fitness Tracker (FitBit Versa)—https://www.fitbit.com/. 
3Freestyle Libre wearable glucose monitor—https://www.freestylelibre.ie/. 
4Moves App for Android and iOS—http://www.moves-app.com/. 
5NTCIR-Lifelog website—http://ntcir-lifelog.computing.dcu.ie/. 
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Table 13.1 Statistics of NTCIR lifelog datasets 
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Criteria NTCIR-12 NTCIR-13 NTCIR-14 
Number of Lifeloggers 3 2 2 

Number of Days 90 days 90 days 43 days 
Collection Size 18 GB 26 GB 14 GB 


Number of Images 
Number of Locations 
Physical Activities 
Calorie Burn 

Step Count 

Heart Rate 

Blood Glucose 
Music Listening 
Cholesterol 

Uric Acid 

Diet Log 
Conventional Photos 


88,124 images 
130 locations 
Moves app 


114,547 images 
138 locations 
Moves app 
Fitness Watch 
Fitness Watch 
Chest Strap 
Daily 
Last.FM 
Weekly 
Weekly 
Manual 


81,474 images 
61 locations 
Moves app 
Fitness Watch 
Fitness Watch 
Fitness Watch 
Continuous 
Last.FM 


Manual 
Smartphone 


supporting privacy preservation in lifelog data (Gurrin et al. 2014a; Memon and 
Tanaka 2014), it was realised that none were effective enough to be deployed in an 
automated manner over lifelog data. Hence, a multi-step process was put in place 
that relied on manual (or semi-manual) redaction, and is summarised as follows: 


e Data Filtering: Given the personal nature of lifelog data, it was necessary to allow 
the lifeloggers to remove any lifelog data that they may have been unwilling to 
share. This sharable data was then reviewed by a trusted member of the organising 
team and further deletions occurred where deemed prudent. 

e Privacy Protection: Privacy-by-design (Cavoukian 2010) was a requirement for 
the test collection. Consequently, faces, readable screens and personal details (e.g. 
bank cards, passports) were blurred in either a fully manual or semi-automated 
process. Additionally, every image was resized down to 1024 x 768 resolution 
which had the effect of rendering most textual content illegible. Following this, a 
validation check was performed on the redaction outputs. 


The overall data redaction and release process is summarised in Fig. 13.2, which 
shows the steps taken by the lifelogger (1), the organisers (2) and the responsibility 
on the task participants (3) who use the data for their experiments. As can be seen, 
the lifelogger gets the opportunity to review, filter and clean their data before the 
organisers carry out a secondary data review and cleaning, followed by the execution 
of a number of processes to ensure privacy of individuals associated with the dataset, 
followed by a final validation of the data before it is released for interested researchers 
who sign up to access the data. 
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Shared Lifelog 
(1) Lifelogger 
Review & Clean 
(2) Review by (3) Release as 
Organisers Dataset with access 


controls 


1. Researcher Review 
and Clean 

. Resize (for text) 

. Anonymisation 

. Validate 


PON 


Fig. 13.2 Overview of the Redaction Process for the NTCIR Collections 


13.4 Lifelog Subtasks at NTCIR 


Based on the use cases described previously and guided by the human memory- 
access applications of Sellen and Whittaker (2010), five different challenges were 
explored at NRCIR-Lifelog. In this section, we focus on the two main subtasks that 
ran for all three Lifelog instances and we briefly describe the other three subtasks. 


13.4.1 Lifelog Semantic Access Subtask 


The Lifelog Semantic Access subtask (LSAT) was the core task of the three editions 
of the Lifelog task. The aim of the task was to explore ad hoc search and retrieval 
from lifelogs, which the organisers believe to be a fundamental enabling technology 
to make lifelogs a useful tool for individuals. In this subtask, the participants were 
required to retrieve a number of specific moments in a lifelogger’s life in response to 
a topic description, as shown in Fig. 13.3. There were either 24 or 48 topics prepared 
for each instance of the task. For the purposes of evaluation, the organisers took the 
simplifying assumption that an image (point-in-time) is an appropriate document 
for retrieval. The task can best be compared to a known-item search task with one 
(or more) relevant items per topic. Evaluation was by means of standard evaluation 
measures and calculated using treceval.° For NTCIR-12 & NTCIR-13, full relevance 
judgements were prepared, but for NTCIR-14, pooled relevance judgements were 


Shttps://trec.nist.gov/trec_eval/. 
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TITLE: Icecream by the Sea 

DESCRIPTION: Find the moment when U, was eating icecream beside the sea. 
NARRATIVE: To be relevant, the moment must show both the icecream with cone in the hand 
of ul as well as the sea clearly visible. Any moments by the sea, or eating an icecream which 
do not occur together are not considered to be relevant. 


EXAMPLES OF RELEVANT MOMENTS FOUND BY PARTICIPANTS 


Fig. 13.3 LSAT Topic Example, including example results 


used. Participants were allowed to undertake the LAST subtask in an interactive or 
automatic manner. For interactive submissions, a maximum of five minutes of search 
time was allowed per topic. 

Over the three instances of the LSAT Task, we note that task participants took 
many different approaches to the development of retrieval systems. Given that there 
are no standardised baselines that can be applied, this is not surprising. Participating 
teams developed many different experimental systems, both interactive and automatic 
in nature. We look firstly at interactive retrieval engines over the three editions of 
NTCIR. At NTCIR-12, the participating team from University of Barcelona (Spain) 
developed the only interactive retrieval engine that integrated a semantic-content 
tagging tool to enhance the quality of the annotations (de Oliveira Barra et al. 2016). 
At NTCIR-13, the DCU team (Ireland) employed a human-in-the-loop to translate 
the provided queries into system queries for their retrieval engine, in one of their runs 
(Duane et al. 2017). However, at NTCIR-14, we note that three of the participants 
developed interactive systems and a fourth participant also integrated the human-in- 
the-loop query enhancement. NTU (Taiwan) developed an interactive lifelog retrieval 
system that automatically suggested to the user a list of candidate query words and 
adopted a probabilistic relevance-based ranking function for retrieval (Fu et al. 2019). 
They enhanced the official concept annotations and pre-processed the visual content 
to remove poor quality images and to offset the fish-eye nature of the wearable camera 
data. DCU (Ireland) developed an interactive retrieval engine for lifelog data (Ninh 
et al. 2019) that was designed for novice users and relied on an extensive list of facet 
filters over provided metadata. Finally, the VNU-HCM (Vietnam) group developed 
an interactive retrieval system (Nguyen et al. 2019) that used enhanced metadata 
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and visual enrichment, sometimes including human annotations. Their scalable and 
user-friendly interface to this system significantly outperformed competing systems 
at NTCIR-14, due primarily to the enhanced annotations. As expected, all interactive 
runs significantly outperformed the automatic runs at each edition of NTCIR-Lifelog. 

In terms of approaches to automatic retrieval, at NTCIR-12, the VTIR (USA) 
team hypothesised that location was a very important component in the information 
retrieval process (Xia et al. 2016), and thus enhanced location semantic descriptions 
were used with the BM25 retrieval model. The authors comment that this approach 
worked well for some of the topics, which were location dependent. The IDEAS Insti- 
tute for Information Industry (Taiwan) took a textual approach to retrieval (Lin et al. 
2016) utilising word2vec to better match visual concepts to user queries (an approach 
referred to as bridging the lexical gap) via query expansion. The QUT group took 
an approach to retrieval that generated long, descriptive paragraphs of text to anno- 
tate the lifelog content, as opposed to the conventional tag-based approach (Scells 
et al. 2016); however, this was not shown to be successful. Finally, the LIG-MRM 
group (France) performed significantly better of all other approaches at NTCIR-12, 
by focusing on enhancing the performance of the visual concept detectors to be used 
for retrieval, and not relying on the provided (Caffe) classifier output (Safadi et al. 
2016). The Caffe classifier provides a modifiable framework for state-of-the-art deep 
learning algorithms and a collection of reference models (Jia et al. 2014). 

AT NTCIR-13, three participating groups took part in the LSAT subtask in an auto- 
mated manner. DCU (Ireland) took part with their baseline search engine (Duane et al. 
2017) that indexed the provided metadata and concepts using BM25 as the retrieval 
model, with both automated query runs and human-enhanced query runs. VCI2R 
(Singapore) proposed a general framework to bridge the semantic gap between lifelog 
data and the event-based LSAT topics (Lin et al. 2017) by enhancing the visual anno- 
tations and employing temporal smoothing of annotations, which proved to be the 
most successful approach at NTCIR-13. Finally, the PGB group (Japan) focused on 
the image and location data and enhanced the visual annotations (including people 
counting) and indexed locations using point-stay detection (D-Star algorithm) and 
integrated important location detection using the DBSCAN algorithm (Yamamoto 
et al. 2017). It performed better than the baseline, but not as well as the VCI2R and 
the human-in-the-loop run by DCU. 

At NTCIR-14, NTU (Taiwan) submitted both interactive and automatic runs, and 
their automatic run (the top-ranked automatic run) included a query enhancement 
process using the top 10 nearest concepts to the query terms to expand the query 
before submitting the query (Fu et al. 2019). QUIK (Japan) from Kyushu University 
integrated online visual WWW content in the search process and operated based 
on an underlying assumption that a lifelog image of an activity would be similar to 
images returned from a WWW search engine for similar activities (Suzuki and Ikeda 
2019). The approach operated using only the visual content of the collection and 
used the WWW data to train a visual classifier with a convolutional neural network 
for each topic. Although an automated process, a human-in-the-loop mechanism was 
employed to filter the WWW examples. 
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After NTCIR-14, the main approaches that the organisers consider to be valuable 
for lifelog access are the use of enhanced visual concept detectors to improve index- 
ing, which has been continually shown to be effective both at NTCIR and the Lifelog 
Search Challenge (Gurrin et al. 2019b), as well as the application of approaches 
to bridging the lexical gap, either via some form of index term expansion or query 
expansion. Given the interest in developing interactive systems, the Lifelog Search 
Challenge is now the main venue for the comparative benchmarking of interactive 
lifelog retrieval systems. 


13.4.2 Lifelog Insight Subtask 


The Lifelog Insight subtask (LIT) also ran at all three editions of NTCIR-Lifelog and 
was designed to explore knowledge mining from lifelogs, with particular application 
in epidemiological studies. The LIT subtask was exploratory in nature, and the aim of 
this subtask was to gain insights into the lifelogger’s daily life activities. It followed 
the idea of the Quantified Self movement that focuses on the visualisation of knowl- 
edge mined from self-tracking data to provide “self-knowledge through numbers’. 
Participants were requested to provide insights that support the lifelogger in the act of 
reflecting upon their life, facilitate filtering, or provide for efficient/effective means of 
lifelog data visualisation. The LIT subtask was not evaluated in the traditional sense, 
rather all participants were asked to write about and bring their demonstrations or 
reflective output at the NTCIR conference. 

At NTCIR-12, the Sakai Lab at Waseda University (Japan) developed a prototype 
smartphone application called Sleepflower, which was designed to improve the sleep 
cycles of a group of users (Iijima and Sakai 2016). A flower metaphor was displayed 
on the smartphone screen to represent the current sleepiness of a particular user, 
based on a manual analysis of the habits of the lifeloggers. Participants from Toy- 
ohashi University (Japan) examined repeated pattern discovery from lifelog image 
sequences, by applying a Spoken Term Discovery technique (Yamauchi and Akiba 
2016) and a variant of Dynamic Time Warping was used in an experimental approach 
to extract meaningful patterns from the lifelog data. DCU (Ireland) introduced an 
interactive lifelog interrogation system which allowed for manual interrogation of 
the lifelog dataset for the occurrence of visual concepts that were assumed to match 
the information needs (Duane et al. 2016). The results of this manual interrogation 
were then used to generate insights and infographics. 

At NTCIR-13, Tsinghua University (China) developed an approach to give 
insights into the big-five personality traits, moods, music moods, style detection 
and sleep-quality prediction (Soleimaninejadian et al. 2017). The team augmented 
the provided dataset with lifelog data gathered by other volunteers. The team found 
that their approaches achieved objective results with a high degree of accuracy, and 
noted the implications for improving traditional psychological research by employ- 
ing lifelog data. Participants from the Institute for Infocomm Research (Singapore) 
presented a method for finding insights from the lifelog data by creating a topic- 
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focused minute-by-minute annotation of the user’s activities (Xu et al. 2017). This 
was achieved by applying deep learning approaches for image analytics and then 
fusing the multimodal sensor data to generate insights into patterns and associations 
between lifelogger activities. The team from DCU (Ireland) introduced a new inter- 
active lifelog interrogation system which was implemented for access in a Virtual 
Reality Environment (Duane et al. 2017). The system was designed to allow a user 
to explore visual lifelog data in an interactive and highly visual manner. Finally, the 
PGB group (Japan) developed an approach to automatically label the lifelog images 
with 15 concept labels (Yamamoto et al. 2017) using a DNN model with a fusion 
layer of tri-modal data (image, location and biometric). 

At NTCIR-14, only one group took part in the LIT subtask. THUIR (China) 
developed a number of detectors for the lifelog data to automatically identify and 
visualise the status/context of a user (Nguyen et al. 2019) and a comparison between 
the various approaches showed that the visual features were significantly better than 
non-visual (metadata) features. 


13.4.3 Other Subtasks (LEST, LAT and LADT) 


A number of additional exploratory subtasks were run once (or twice) only. We will 
briefly describe these and comment on why they were not run in all three instances of 
the Lifelog task. The Lifelog Event Segmentation subtask (LEST) ran at NTCIR-13, 
the aim of which was to examine approaches to event segmentation from continual 
lifelog stream data (Gurrin et al. 2017). Event segmentation had been the typical 
approach to generation of indexable and retrievable documents (events) from lifelog 
collections. Given that the definition of an event is inherently subjective to the expe- 
rience of the individual lifelogger, the organisers defined 15 types of events for the 
segmentation process, based on the 15 common lifestyle activities defined by Kah- 
neman et al. (2004). The PGB group (NTT, Japan) participated in the LEST and 
developed a number of alternative approaches to event segmentation, included tem- 
poral visual similarity, user-linger-points, the use of LDA to reduce dimensionality 
and identify boundaries, and a multi-feature approach that used cosine similarity 
between segments (Yamamoto et al. 2017). The user-linger-points approach proved 
to be the most successful for event segmentation. 

At NTCIR-14, this LEST morphed into the Lifelog Activity Detection subtask 
(LADT) at NTCIR-14 (Gurrin et al. 2019a), which required the classification of 
the multimodal lifelog data into one or more human activities that were identified 
as occurring in the lifelog collection. The NTU group (Taiwan) developed a new 
approach for the multi-label classification of lifelog images (Fu et al. 2019). In 
order to train the classifier, the authors manually labelled 4 days, which were chosen 
because they covered most of the activities that the lifeloggers were involved in. 

However, the organisers note that there was little interest from the community in 
this task. This was surprising, since many of the previous applications of lifelog data 
to solve real-world challenges (e.g. healthcare or epidemiological studies) would 
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require the detection of human activities as a fundamental building block. Perhaps, 
this task will become very relevant and interesting at a later date, once lifelogging 
becomes a more commonplace activity for personal use or scientific enquiry. 

It is worth noting that one outcome of this subtask was a new pilot task at NTCIR- 
15, which has a micro-activity detection/retrieval task (called MART) that extends 
this early work by focusing on the identification of short activities of daily life (e.g. 
writing an email, making a cup of coffee) and is targeted at the generation of rich 
and detailed semantic logs of everyday activities. 

Finally, another exploratory subtask that ran at NTCIR-13 was the Lifelog Anno- 
tation subtask (LAT), which aimed to develop approaches for annotation of the mul- 
timodal lifelog data (images) with a fixed set of 15 high-level labels/concepts chosen 
from a manually generated ontology of lifelogging activities (Gurrin et al. 2017). 
These concepts were based on both the activities (facets of daily life) of the indi- 
vidual and the environmental settings (contexts) of the individual. Motivated by 
the realisation from NTCIR-12 that high-quality annotations are important for the 
retrieval process, the aim of this task was to provide various sets of high-quality shared 
annotations for all other uses to use in the LSAT subtask. However, only one group 
participated, so this annotation sharing did not occur. The PGB group (Yamamoto 
et al. 2017) developed a DNN model, with a fusion layer of tri-modal data (image, 
location and biometrics) to perform the content annotation. It was found that visual 
and biometric features can enhance the automatic annotation process, yet location 
actually was found to reduce annotation quality. Once again, this task was not attrac- 
tive to NTCIR participants, so the Lifelog Activity Detection subtask (LADT) at 
NTCIR- 14 replaced it. 


13.5 Lessons Learned 


Since NTCIR-12, 18 different research groups have taken part in the Lifelog task, 
some of them multiple times and across multiple tasks. Uptake on the subtasks 
suggests that the community is interested in the retrieval challenge and, to a lesser 
extent, the insights challenge. The other three challenges have not attracted much 
interest at this point. At the end of the NTCIR-Lifelog tasks, we can identify some 
lessons learned from the three editions of the NTCIR-Lifelog task: 


e Novel Datasets: Eighteen participants submitted official runs to NTCIR, but at 
least three times as many downloaded the datasets. Even 4 years after starting 
the NTCIR-Lifelog task, requests for the datasets are still being received by the 
organisers. There is clearly an interest in the community to develop retrieval and 
analytics tools over such datasets, so there is significant potential for others in the 
community to define and release novel datasets of human life-experience data. 

Richer metadata: Repeatedly, we have seen that the best performing retrieval sys- 
tems enhanced the provided metadata by relying on additional visual concept 
detectors, or seeking additional sources of metadata to enhance the retrieval per- 
formance. There is clearly a need to develop new approaches to the creation of 
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semantically rich metadata for multimodal lifelogs, in order to facilitate more 
effective retrieval algorithms. 

e Bridge the lexical gap: Many participants found that there was a lexical gap 
between the terms used by the lifeloggers in their topic descriptions, and the 
indexed textual content and annotations. This suggests a need for term or query 
expansion, and the current consideration is that this could be achieved using 
approaches such as conventional query expansion or word embedding. 

e Integrate external WWW content: This has been used by some participants with 
positive results. The external content helps to enhance the quality of content anno- 
tations or can be used as a form of query enhancement. 

e There is an observed interest in the generation of insights or knowledge from 
lifelog data, as seen by the participation in the LIT subtask. This seems best 
suited to addressing the reflection and reminiscence use case of human memory 
as outlined by Sellen and Whittaker (2010). 

e Document segmentation of the lifelog data into indexable content is as of yet an 
unsolved challenge. Initial attempts at lifelog ‘event segmentation’ (Lee et al. 2008) 
generated static documents for retrieval using an early sensor-based approach to 
segmentation. As with any information retrieval system, the concept of a document 
needs to be clearly defined and understood, which is not yet the case for lifelog 
data. 

e Interactive search: Finally, interactive systems have been increasing in interest 
since NTCIR-12 and the Lifelog Search Challenge (Gurrin et al. 2019b) has been 
started to specifically explore this challenge. This appears to be the current hot 
topic for lifelog search and retrieval. 


13.5.1 Conclusions and Future Plans 


Over the course of the three instances of the NTCIR-Lifelog task, the uptake by 
participants was not as high as the organisers had hoped. One reason for this may be 
the emergence of a suite of parallel activities to motivate research into lifelogging and 
personal data analytics, such as the previously introduced interactive Lifelog Search 
Challenge (Gurrin et al. 2019b) and the ImageCLEF-Lifelog activities (Ionescu et al. 
2018). The Lifelog Search Challenge in particular his been attracting 8—10 groups 
annually who come together to partake in a real-time interactive search challenge, 
which provides an open forum for all ACM ICMR conference attendees to partake 
as either observers or even as novice users in the competition. The ImageCLEF- 
Lifelog task tends to attract researchers more focused on the computer vision aspects 
of insight generation and data organisation and as such, it is targeting a slightly 
different audience. Regardless of the reasons, the uptake of the task and the level 
of interest in the dataset, along with the other related activities suggests a keen 
level of interest in the community for lifelog retrieval and the organisers note that 
this interest is likely to grow as volumes of personal multimodal data increase in 
society. The organisers understand that lifelog retrieval is a challenging activity, 
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and the future of the Lifelog task at NTCIR is perhaps in the refinement of the 
task to address key challenges in the domain, such as privacy-aware retrieval from 
personal multimodal data, epidemiological-scale analytics studies that analyse large 
lifelogs from multiple participants, targeted healthcare tasks of interest to concerned 
individuals and medical professions (e.g. finding medicine-taking events), or novel 
related-domains such as neural data retrieval. 

It is an inevitable fact that the main challenge for any organisers of such tasks is 
the effort required to generate appropriate and real-world datasets and release them 
in an ethically and legally complaint manner. The three lifelog datasets released 
by the task organisers at NTCIR represent about a year of effort in total from a 
number of researchers and lifeloggers; this naturally incurs significant expenses in 
terms of organisers time and resources. Real-world use cases are likely to either 
focus on retrieval from longitudinal archives donated by one individual, or across 
large populations (as in epidemiological studies) and the data gathering and release 
methodology employed for this task was not ideal, due to the large overhead of effort 
required to ensure privacy preservation. The evaluation-as-a-service model proposed 
by Hopfgartner et al. (Hopfgartner et al. 2020) is one potential way forward, which 
brings the algorithms to the data, rather than the conventional data-to-algorithm 
approach. Another potential next step is to encourage more comparative evaluation of 
interactive systems, since a user of a lifelog tool (either an individual or a professional 
analyst) is most likely to be using such tools in an interactive manner. In any case, 
the organisers of the NTCIR-Lifelog tasks consider that this book chapter marks the 
end-of-the-beginning of research into lifelog data organisation and retrieval, rather 
than the conclusion of a short-lived sub-topic of IR. It is our belief that lifelogging 
as a topic will continue to become more popular for IR researchers and that the 
availability of relevant datasets and challenges will increase in the coming years. 
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Chapter 14 A) 
The Future of Information Retrieval gaa 
Evaluation 


Douglas W. Oard 


Abstract Looking back over the storied history of NTCIR that is recounted in this 
volume, we can see many impactful contributions. As we look at the future, we 
might then ask what points of continuity and change we might reasonably anticipate. 
Beginning that discussion is the focus of this chapter. 


14.1 Introduction 


In his book The Third Wave, Alvin Toffler placed what many have called the Infor- 
mation Age alongside the two most consequential transformations in human society, 
the introduction of agriculture, and the industrial revolution (Toffler 1980). That 
information retrieval will continue to play a central role in the coming years thus 
seems undeniable. One point of continuity between the current era and the flowering 
of science that helped to foster the industrial revolution is Lord Kelvin’s admonition 
that “if you can not measure it, you can not improve it.” Hence, the central role of 
information retrieval evaluation seems assured as well. That is not to say, however, 
that we will continue to measure our results in the same ways. Indeed, it seems 
reasonable to expect that information retrieval evaluation will continue to co-evolve 
along with changes in the information ecosystems that it serves. This chapter reflects 
on both the emergence of shared task evaluation and on present trends in information 
retrieval evaluation. 
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14.2 First Things First 


Shared task evaluation arose in information retrieval from the convergence of two 
broad lines of work. The first was the test collection tradition in information retrieval 
that dates back to the early Cranfield collections of the 1960s (Cleverdon 1991). 
The central idea in a test collection is to model the behavior of a user by selecting 
some representative set of documents! to be searched, generating representative 
search topics, generating representative queries for those search topics, and finally 
generating relevance judgments for some useful set of query-document pairs. 

It was the need for relevance judgments that ultimately led to the creation of shared 
task evaluation for information retrieval. Many early collections were exhaustively 
judged (i.e., all query-document pairs had a relevance judgment), but as the docu- 
ment collections became larger exhaustive judgments proved to be infeasible. The 
challenge of larger collections was compounded by the emergence of search topics 
for which relatively few documents in the collection would be relevant. It was those 
topics seeking rare documents that made random sampling unsuitable as a means of 
dealing with increasing collection sizes. The approach that was ultimately adopted, 
pooling, relied on a form of purposeful sampling in which samples were drawn only 
from document sets in which existing retrieval systems had difficulty distinguishing 
between documents that were relevant and documents that were not. Ranked retrieval 
was becoming an increasingly widespread object of study at the time the idea of pool- 
ing was first tried in the Text Retrieval Conference, so this approach to sampling was 
generally operationalized as merging sets of documents that were highly ranked by 
one or more of several representative ranked retrieval systems (Voorhees and Harman 
2005). It was this need for contributions of results from a number of representative 
systems that led to the emergence of shared task information retrieval evaluation. 

In the movie The Right Stuff about the early American space program, one of 
the characters observes the importance of financial support with the pithy quote “No 
bucks, no Buck Rogers.” Shared task evaluation requires resources for planning and 
coordination, but most essentially for creating the relevance judgments. This side of 
the equation came from the Defense Advanced Research Projects Agency (DARPA) 
in the United States, where the voice of Lord Kelvin was strong. The competition for 
funding within DARPA was adjudicated in part using the “Heilmeier Catechism,” 
a set of questions to be answered by any new program, one of which is “What are 
the mid-term and final ‘exams’ to check for success?” DARPA had started a human 
language technology program, focusing initially on speech recognition, in 1986. 
Central to that program was a focus on evaluation. By 1990, DARPA was ready 
to expand its focus to include information retrieval. Hence was born the TIPSTER 
program, which in turn supported the early years of the Text Retrieval Conference 
(TREC). 

As is sometimes the case when innovating, shared task evaluation rapidly evolved 
well beyond its initial focus on measurement. TREC did indeed produce test collec- 


1 Although it is conventional to refer to documents, the term is often used inclusively to refer to 
other types of information objects as well. 
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tions. Importantly those collections were shown to be reusable to a useful degree, thus 
permitting test collections developed in one year to be used in subsequent years as a 
basis for testing refinements to the system design. This approach, which came to be 
called evaluation-guided research, emerged in parallel in several research communi- 
ties (e.g., information retrieval, speech recognition, and named entity recognition). It 
would be well recognized by machine learning researchers today as an early instance 
of supervised learning (albeit one with substantial human intervention in the early 
days). A second important thing that TREC did was that it produced baseline results 
to which future results could be compared. This facilitated the entry of new research 
teams, who could compare their systems against established baselines. A third inno- 
vation was the emergence in 1996 of TREC’s more narrowly focused “tracks” to 
support specific research goals. These three innovations—collections, comparisons, 
and communities—together serve as a useful frame for examining not just shared 
task evaluation in TREC, but approaches to information retrieval evaluation more 
generally. 

Much has been written about the benefits of shared task evaluation, but when 
considering alternatives it is equally important to consider its limitations as well. 
Perhaps most obviously, shared task evaluation is expensive. For example, the cost 
of the first 18 years of TREC, was calculated to be $29 million USD (Tassey et al. 
2010), which is clearly well beyond what many individual researchers could support 
on their own. Two natural results of this are that some process for making investment 
decisions is needed, and those decisions must initially be made before seeing what 
the results will be. Those facts, in turn, tend to result in multi-year commitments to 
a research program so that insights generated in one year can be capitalized upon 
in the subsequent years. As a result, shared task evaluations have a limited capacity 
to start on new lines of work. Perhaps even more importantly, the need for some 
decision process, whether centralized or consensus-based, results in there being some 
gatekeeper role beyond the individual researcher that must judge whether a broad 
line of research merits the community’s attention. Moreover, schedule considerations 
result in proposals needing to be made early—typically more than a year before the 
first results will become available. None of these limitations are show stoppers for 
research problems that require large-scale “team science” experimentation, but there 
are many settings (e.g., commercial research on problems with immediate operational 
implications, or a single student working alone on a novel problem in a 3-year Ph.D. 
program) for which shared task evaluation is not sufficiently responsive. 

A second critique of shared task evaluation is that it can generate a tendency 
toward convergence in methods, perhaps thereby delaying the exploration of impor- 
tant alternative approaches. To see an example of this, we need to look no further 
than the current widespread interest in neural “deep learning” methods. This sort of 
bursty convergence in which new techniques are rapidly explored by the commu- 
nity has benefits, but the degree of convergence that in engenders has risks as well. 
Importantly, this risk is not unique to shared task evaluations—it is simply the flip 
side of any approach in which researchers come together as a community to compare 
results in an evaluation-guided research setting. 


208 D. W. Oard 


14.3 The Shared Task Evaluation Ecosystem 


In the two decades that followed TREC’s creation, shared task evaluation expanded 
at an impressive pace. Some notable examples (with the year in which they started) 
include the following: 


TDT (1996): The Topic Detection and Tracking (TDT) evaluation formed as a 
parallel evaluation venue to TREC to focus on streaming news content in text and 
speech (Wayne 2000). 

NTCIR (1999): The focus of this volume, NTCIR formed as a counterpart to TREC 
with a focus on East Asia. 

CLEF (2000): Initially called the Cross-Language Evaluation Forum, CLEF ini- 
tially spun out from the TREC CLIR track (Braschler and Peters 2004). 

INEX (2002): The Initiative for Evaluation of XML Retrieval (INEX) formed 
independently to focus on retrieval of structured documents, and ultimately became 
a task in CLEF (Lalmas and Tombros 2007). 

TRECVID (2003): The TREC Video Retrieval Evaluation (TRECVID) is a sepa- 
rate evaluation venue that initially spun out from the TREC Video Track (Smeaton 
et al. 2006). 

MIREX (2005): The Music Information Retrieval Evaluation eXchange (MIREX) 
implemented a large-scale infrastructure for evaluation, using algorithm deposit 
to accommodate copyright concerns (Downie et al. 2014). 

FIRE (2008): The Forum for Information Retrieval Evaluation (FIRE) has a focus 
on South Asia (Majumder et al. 2018). 

MediaEval (2010): The MediaEval Benchmarking Initiative for Multimedia Eval- 
uation initially spun out from the CLEF VideoCLEF Task (Larson et al. 2017). 


No such list could ever be complete, since shared task evaluation exists any time 
two or more research groups come together around an evaluation task. For example, 
several evaluations have been conducted in a national context, including in China, 
France, Russia, and South Korea. Moreover, the boundaries between information 
retrieval and the cognate disciplines of natural language processing and speech pro- 
cessing are porous, and there have been evaluations in those communities that cer- 
tainly bear on information retrieval research. For example, there have been evalu- 
ations of both event detection and summarization in the Text Analysis Conference 
(TAC),” and there has been evaluation of spoken term detection in the Open Key- 
word Search evaluation,* both of which are, like TREC, organized by the National 
Institute of Standards and Technology (NIST). 

All of those are TREC-like, in that they are evaluation venues independent of any 
larger event, in which participants actually come together in a workshop-like setting 
to discuss their results. There are, however, numerous additional examples in which 


*https://tac.nist.gov/ 
Shttps://www.nist.gov/itl/iad/mig/open-keyword-search-evaluation 
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one or both of those characteristics are not present. Cases in which a shared task 
evaluation is organized in conjunction with a larger event are sometimes called “data 
challenges.” The granddaddy of these data challenges was perhaps SensEval, named 
for its focus on Word Sense Disambiguation. SensEval initially formed independently 
in 1998, but then associated itself with a workshop starting in 2001 (and later changed 
its name to SemEval in 2007, reflecting its broader interest in semantics).* The 
Conference on Computational Natural Language Learning (CoNLL) started a shared 
task in 1999,° followed in 2001 by the Document Understanding Conference (DUC, 
which despite its name was actually a workshop series, initially held at SIGIR). 
SemEval and the CoNLL shared task continue as data workshops to this day, having 
been joined by many others (e.g., the Big Data Cup°); DUC ultimately became a 
standalone venue (as TAC). 

If data challenges are one step away from independent shared task evaluations 
such as NTCIR and TREC, prize-based competitions represent an even further depar- 
ture from the independent conference paradigm. Perhaps the best known members 
of this genre of shared task evaluation are Kaggle’ and the Netflix Prize (Bennett 
et al. 2007). The Netflix Prize started in 2007 with the goal of advancing research 
on large-scale recommender system. Netflix, a provider of streaming video services, 
offered participants access to a large collection of anonymized usage data, offering a 
$1 million USD reward for achieving a 10% improvement over the company’s best 
current algorithm. Kaggle was founded in 2010 to capitalize on similar opportunities 
for a broad range of problems, acting as a forum within which communities could 
form around specific challenges. Kaggle has in turn given rise to other similar venues, 
including Tianchi® and Innocentive.’ Prize competitions often operate as a market 
in which sponsors define the task and then pay the prize in exchange for a license to 
commercially use the technique that wins the competition. This stands in sharp con- 
trast to the non-commercial ethos of many of the independent shared task evaluations 
listed at the start of this section, which focus principally on pre-competitive basic 
research. Indeed, some of the independent shared task evaluation venues actively 
seek to minimize the competitive aspect of shared task evaluation, in part because of 
concerns that a “winner-take-all” perspective might depress participation by teams 
who would otherwise be able to contribute diversity to the document pools that will 
be judged for relevance. 


4https://aclweb.org/aclwiki/SemEval_Portal 
Shttp://www.conll.org/previous-tasks 
Shttp://cci.drexel.edu/bigdata/bigdata2019/BigDataCupChallenges.html 
Thetps://www.kaggle.com/ 

Shttps://tianchi.aliyun.com/ 

°https://www.innocentive.com/ 
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14.4 A Brave New World 


In the movie The Wizard of Oz, Dorothy observes at one point that “we’re not in 
Kansas anymore.” So it is with information retrieval evaluation as well—there are 
now many more things under the sun than just shared task evaluation. At least four 
alternatives can be discerned, each of which has its own strengths and weaknesses. 
The first to emerge were project data repositories. Perhaps the best known of these 
is the Linguistic Data Consortium (LDC) at the University of Pennsylvania,!° which 
was founded in 1992 with support from DARPA to serve as a repository for the human 
language technology community. LDC and similar organizations around the globe 
(e.g., the European Language Resources Association, ELRA,!! or the Linguistic 
Data Consortium for Indian Languages, LDC-IL'*) permit researchers to deposit test 
collections that they have created that may in the future be of use to others. In this way, 
what were once internal evaluations on data generated within a project can become 
shared, and over time can emerge as a shared task reference to which future work can 
be compared. Perhaps the most successful example of this general approach is the 
University of California Irvine Machine Learning Repository (Dua and Graff 2017), 
which provides test collections that serve as standard references among machine 
learning researchers (notably including some text classification researchers). 
Project data repositories help with community formation and with providing a 
basis for comparisons, but (at least when serving solely as repositories) they do not 
create collections. That’s where crowdsourcing comes in. Shared task evaluations 
in the TREC heritage predate the World Wide Web, but as user-generated content 
became more pervasive in what came to be called Web 2.0, crowdsourcing emerged 
as an alternative way of obtaining relevance judgments (Alonso 2019). Crowdsourc- 
ing can be used in many ways in the evaluation of information retrieval systems, but 
perhaps the most obvious alternative to the approach used in shared task evaluation 
is to simply pay crowdworkers to create relevance judgments. Because queries are 
often treated as independent in information retrieval test collections, the relevance 
judgment task is easily distributable across multiple crowdworkers. At least two 
concerns arise when this is done. First, crowdworkers may be less well trained or 
less attentive to their task than relevance assessors who work at a central facility as 
their primary job would be. This concern has spawned a line of work on assessing 
the accuracy of crowdworkers. Second, one common approach to managing those 
risks, having several crowdworkers vote on the correct relevance label, has the effect 
of subtly redefining relevance (for purposes of evaluation) away from the opinion 
of an individual and toward the consensus of a group. Balanced against these con- 
cerns, however, are the speed, scalability, and relative affordability of crowdsourcing. 
Moreover, the diversity of available crowdworkers can provide access to people with 
needed skills (e.g., language skills or some types of topic expertise) that simply might 
not be available otherwise. For these reasons, crowdsourcing can offer transforma- 


!Ohttps://www.lde.upenn.edu/ 
' http://www.elra.info/en/ 
2http://www.ldcil.org/ 
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tional advantages to isolated researchers who, for reasons of location, funding, or 
problem specificity simply cannot plausibly create a shared task evaluation. Note, 
however, that crowdsourced test collections need not remain isolated once they have 
been created, since they can be shared through data repositories. 

Creating test collections is, however, just one of at least two ways in which crowd- 
sourcing can be used for information retrieval evaluation. An alternative is to study 
the actual use of a system using crowdworkers. Test collections have many desirable 
attributes, but no test collection captures every important aspect of actual information 
retrieval tasks. Evaluating information retrieval systems in actual use has tradition- 
ally been a focus of user studies, and crowdsourcing offers an opportunity to extend 
the user study beyond the researcher’s laboratory across the Internet to meet the 
users where they are. This opens new opportunities to intermix research using test 
collections (which are optimized for affordably repeatable evaluation under con- 
trolled conditions) and user studies (which offer higher fidelity evaluation, but at 
incremental cost each time an experiment is run). 

There are, of course, limits to the user studies that can be run with crowdworkers. 
In addition to the obvious limits imposed by affordability considerations, fidelity is 
always a concern when paying a user to perform a task that you have designed. One 
way of addressing both of these concerns is to perform what has come to be called 
online evaluation (Radlinski and Craswell 2010). The basic approach is simple. First, 
build a system that becomes so popular that there will be a large number of users 
whose behavior you can study. Then design experiments in which some aspect of the 
system (the independent variable) is changed, and the effect is observed by observing 
some behavioral signal (the dependent variable). Variants on this idea include A-B 
testing and interleaving. Of course, the first step there—creating systems that have a 
large user population—can be a tad expensive! But once such a system is available, 
a very large number of experiments can be run at low cost. Naturally, this approach 
is popular among commercial services that have a large user base. Batch evaluation 
measures have also been tuned using query logs, thus more closely linking online 
and offline (i.e., batch) evaluation (Ferrante et al. 2014). 


14.5 Trendlines 


One thing that should be clear from the story to this point is that independent shared 
task evaluations such as NTCIR are now just one part of an increasingly diverse 
and specialized evaluation ecosystem. But that is just one of many trendlines that 
together will continue to reshape the future of information retrieval evaluation. This 
section reviews several others. 

It is fashionable today in many contexts to remark on convergence. What used 
to be separate devices (e.g., phones, computers, and televisions) now are one. What 
used to be stored on separate media (video, images, documents, datasets) are now 
all stored as digital files. What used to be separate functions (computing and com- 
munication) are now becoming nearly inseparable. All of these are examples of 
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convergence. We are seeing examples as well of convergence across fields. Infor- 
mation retrieval researchers use speech and language technologies that in an earlier 
time would have been thought of as separate fields. Database researchers work with 
semi-structured data that the information retrieval community would recognize as 
structured documents. Data scientists analyze interaction patterns to help optimize 
the user experience. Interactive information retrieval research draws in equal mea- 
sure on insights from information retrieval and human-computer interaction. Work 
on fairness, accountability, and transparency in machine learning finds application 
in designs of information retrieval systems that are informed as much by social as 
by technical goals. This convergence of disciplines creates new opportunities, but at 
the same time it challenges the notions we have developed over time about what is, 
and what is not, information retrieval. 

If convergence disrupts what it is we think we do, the Internet is perhaps even 
more disruptive because it changes where we can do it. In an earlier era, information 
retrieval research suffered from what we might call the tyranny of geography. There 
were a few places in the world where top flight information retrieval research was 
going on, and it was much easier to get into the field if you could get to one of those 
places. Today, information retrieval is taught in many places, and indeed well over 
half the world’s population has access to free online courses on the topic. Cloud 
computing has gone some distance toward democratizing access to high-end com- 
puting, and the widely available low-end computing infrastructure has capabilities 
that were unavailable anywhere on Earth just a few decades ago. We have by no 
means completely erased the tyranny of geography at this point in history, but it is 
quite clearly on the wane. 

Solving one problem often reveals another, and so it is with the competition for our 
attention. For essentially all of human history, and with rare exception, information 
was scarce and human attention was relatively abundant. No one with an Internet 
connection can fail to notice that the situation today has sharply reversed, and that it 
is information that is abundant, while it is human attention that is now scarce. If we 
view our job as helping to separate the wheat from the chaff, it should be clear that 
this trendline suggests that we’ll have no shortage of important problems to work on. 

Another trendline worthy of remark is that the nature of gatekeeping is shifting. 
Long ago we had to choose between a Web track, a filtering track, an interactive 
track, or whatever other ideas were put forward, because venues like NTCIR simply 
could not do everything. It’s still not possible to do everything, but the emergence 
of options such as crowdsourcing and online evaluation greatly expand the range of 
information retrieval evaluations that can be conducted. That’s not to say that there 
will be no gatekeepers. Peer review, for example, will continue to play some role 
with regard to what gets published. But to the extent that some of the gatekeeping 
can be shifted from before the work is done to after the results become available, 
that could help to enhance the diversity of the research ecosystem. 

One foundational assumption in information retrieval is that information wants to 
be found, and that our job is to find it. That’s actually probably not true for much of 
the information in the world, however. Examples abound of information that should 
not be found. In Europe, the right to be forgotten is a right not to have specific infor- 
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mation about you found. In many countries with legislation that promotes freedom of 
access to government information, specific exemptions identify types of information 
that should not be disclosed. We have debates today about which types of infor- 
mation governments or commercial entities should be allowed to use, and for what 
purposes. Article 12 of the Universal Declaration of Human Rights declares privacy 
to be a human right, with all of the complexity that operationalizing the meaning of 
such a statement entails. In an earlier era, information retrieval research encountered 
restrictions on access from time to time, and in such cases the response of researchers 
was generally to focus instead on the many cases in which access control was not a 
problem. 

We are perhaps now nearing the limits of that strategy. Consider the fact that almost 
all of the words produced on the planet—probably upward of 99%—are spoken, not 
written. Couple that with the fact that well over half that speech is produced in the 
presence of a networked recording device (e.g., amobile phone). And couple that with 
the fact that both the speed and accuracy of technology for automatically transcribing 
that speech has improved by leaps and bounds in recent years. At present, we are 
largely disregarding all of that content simply because we have no idea how to protect 
those parts that need to be protected. This has implications for research, of course, 
but it has implications for evaluation design as well. We have grown up in an era 
in which we all learned to respect copyright when dealing with test collections. We 
now need to learn how to deal with sensitive content that will in some cases prevent 
us from distributing test collections. That does not mean that we won’t be able to do 
shared task evaluations, but it does mean that we’ll need to think anew about how 
best to do them. The Netflix Prize, for example, ended because of a privacy lawsuit. 

It has been said that “data is the new oil,” a catchy phrase intended to illustrate that 
there is money to be made. At one time, most information retrieval researchers worked 
in universities. Today, the balance has shifted very strongly in favor of industry. 
That’s good news, because that’s where the money is, so there is now vastly more 
research on information retrieval being published than ever before. It is also good 
news because industry has access to evaluation opportunities that simply can’t be 
replicated elsewhere, most notably with online evaluation. And it is also good news 
because all this commercial activity is helping to bring new problems to the attention 
of the information retrieval research community. 


14.6 An Inconclusion 


It is traditional to end a chapter with a conclusion, but when writing about the future 
perhaps it would be wise to recognize that the evidence we see today is not suf- 
ficiently conclusive to allow us to see that future with clarity. Herewith, therefore, 
some inconclusive remarks. Josef Schumpeter is best known for his description of 
creative destruction, a process by which innovations result in the displacement of 
earlier enterprises that had been built to leverage earlier innovations (Schumpeter 
1942). As the convergence examples above indicate, creative destruction is at least 
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as vibrant today as it was when Schumpeter was writing. Independent shared task 
evaluations such as NTCIR were created in an earlier era, to fill a role that has since 
been augmented, and perhaps partially replaced, by other approaches to information 
retrieval evaluation. It therefore seems timely to consider the question of what role 
NTCIR, and other independent shared task evaluations, may play in the future. For- 
tunately, the very name of NTCIR, the NII Testbeds and Community for Information 
Access Research, can help to guide that discussion. 

N is for NII, the National Institute of Informatics. NII, like NACSIS before it, 
has been a source of leadership, not just in information retrieval evaluation, but 
in the emergence of a vibrant information retrieval research community in Japan 
specifically, and in East Asia more generally. Ultimately, NII is made up of people, 
and it is the choices made by those people that will define the future leadership role 
of that institution. With wise choices, that N will remain a capital letter. 

T is for Testbeds. As explained throughout this chapter, the testbeds of the sort 
NTCIR has created (principally, test collections) are one part of what is now a rich 
ecosystem of evaluation methods. There will surely continue to be demand for test 
collections, but shared task evaluations like NTCIR are no longer the only affordable 
way in which test collections can be created, and we now live in a world in which a 
broader range of testbeds can be affordably constructed. We therefore may see the T 
in NTCIR decline somewhat in its impact, perhaps becoming a lower case t. 

C is for Communities. For all the trendlines that portend change, one thing that 
seems unlikely to change any time soon is human nature. Humans are social animals, 
and research is a social enterprise. We need ways of bringing people together around 
new problems, ways of helping new people to join those communities, ways of 
creating the kinds of shared understanding that are needed to learn from each other 
how best to solve those problems, and ways of defining what it would mean to 
succeed at solving those problems. Shared task evaluations like NTCIR serve all of 
those functions. The C in NTCIR seems destined to remain a capital letter. 

lis for Information access. As noted at the start of this chapter, we live in an infor- 
mation age, and it therefore seems unlikely that the focus of NTCIR on information 
would diminish. The same might not be said for access, however, since we are now 
seeing some convergence of research on (at least) information access, information 
creation, information understanding, information manipulation, and information pol- 
icy. So the I in NTCIR seems sure to remain capitalized, but we may see some shifts 
in what it stands for. 

R is for Research. We might think of research in three ways. The most obvious is to 
think narrowly in terms of some specific type of research, such as evaluation-guided 
research or statistical hypothesis testing. An alternative is to think of research more 
inclusively, as any systematic way of generating new and generalizable knowledge. 
And a third alternative would be to think even more broadly about research, as 
an undergraduate student might, as self-directed learning about new things. Many 
people who do not see themselves as researchers in the first or second sense need 
to do research in the third sense. One way or another, the R seems likely to remain 
since it is central to the self-image of NTCIR, but perhaps the meaning of that R will 
shift somewhat over time. 
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Well, there we have it. It seems that we can look forward to a world in which 
NtCIR remains, and all we will need to do is to figure out what it actually stands for! 
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